WEBRCS

(Clean Version Control for Web Sites)

Mark Eichin The Herd Of Kittens

Abstract

CVS has become the Free Software Solution for version control; Apache likewise for web service. WEBRCS attempts to provide a clean integration of the two, allowing effective version control of the content of a web site.

Goals

The goals of the project are simple: allow Apache to transparently serve files from a CVS repository. Periodic update of a "sandbox" directory would either always be somewhat out of date, making it harder to test changes on the "live" server, and requiring authors to have some understanding of the update timing. Commit triggered update is feasible, though relatively complex, and contains multiple failure points, thus potentially needing a "resynchronize" step; it also doubles the total required storage.

A more straightforward approach is to let the web server directly access the repository and convert the files on the fly. While this might also seem expensive (certainly moreso than direct file access) it shouldn't be very expensive, since the RCS file format is optimized for rapid construction of the "head" revision.

Prototype Implementation

The initial implementation is split into two parts, the webrcs CGI script and the access.conf changes to invoke it.

The access.conf changes use the new ModRewrite feature of Apache 1.2. The actual additional lines are as follows:

RewriteEngine on
RewriteLog /var/log/apache/rewrite.log
rewriterule ^/rtest/(.*)$ /cgi-bin/webrcs?$1 [PT]
ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

The first line enables the rewriting feature[1] while the second causes rewrite operations to go to an appropriate location. The last two lines must appear together in the config file; RewriteRule tells it to convert any URL starting with rtest to a "get" query on the webrcs CGI script, and then to Pass Through the result to the following ScriptAlias line which actually makes it a CGI invocation.

[1]
which doesn't exist in 1.1, this specific line is usually the clearest indication of that.

In Apache 1.3, the configuration is instead

RewriteEngine On
RewriteLog /var/log/apache/rewrite.log
RewriteLogLevel 5
RewriteRule ^/rtest/(.*)$ /cgi/webrcs?$1 [T=application/x-httpd-cgi]

Note that the ScriptAlias command is no longer needed due to the explicit MIME-type T= tag.

The script itself simply has to take the pathname given (which appears in the QUERY_STRING environment variable) and look it up in the repository, add a Content-type: header, and dump the contents to standard output. These can be treated as distinct pieces, passing some partial result further down the chain.

$conf = $ENV{SCRIPT_FILENAME}.".conf";
do $conf if -r $conf;
&status(400, "cvsroot not defined") if ! -d "$cvsroot/CVSROOT";
&status(400, "realbase not defined") if ! defined $realbase;

First the script uses a crude mechanism to find a config file based on the name of the script itself. It then does a sanity check that the configuration actually did point at a real repository by checking for the CVSROOT directory. Likewise we verify that $realbase is set so that later code can generate a proper redirect. If these checks fails, we generate appropriate http errors and use the status function to transmit it back to the server, which translates it into a Status-line to send to the client.

sub status {
    my $st = shift;
    my $msg = shift;
    print "Status: $st $msg\r\nContent-type: text/html\r\n\r\n";
    print "<html><head><title>$msg</title></head>\r\n";
    print "<body><h1>$msg</h1></body>\r\n";
    die "webrcs: $msg";
}

The status function itself sends back a Status header to let the server (and thus the browser) know about the error, and then some html to let the user know clearly as well.

$filename = $ENV{"QUERY_STRING"};
&status(404, "no path supplied") if not defined $filename;
&status(400, "bad request") if $filename =~ /\.\./;
&status(400, "bad request (Attic)") if $filename =~ /\/Attic/;

After configuring itself, the script actually reads the query from the user. If somehow there is no query, we report an error (perhaps the user has invoked the cgi directly, instead of via a rewrite rule, or is running the program by hand.) It then performs a simple check on the validity of the path, and another check to make sure the user isn't trying to reach a file that we deleted. Further checks could be added, but since we don't do any wildcard processing on the path, are probably not actually necessary.

if ( -d "$cvsroot/$cvsbase/$filename" ) {
    if ("$cvsroot/$cvsbase/$filename" =~ m+/$+) {
	$filename .= "index.html";
    } else {
	# should redirect to a trailing-/ version, so later paths work
	# but that means we have invert the rewrite... so the config
	# file now needs $realbase...
	my $xp = "$realbase/$filename/";
	# http/1.1 says that should be http-version, but we haven't negotiated
	print "Status: 301 Moved Permanently\r\n";
	print "Location: $xp\r\n\r\n";
	print "<html><head><title>Don't drop trailing slashes</title></head>\r\n";
	print "<body><h1>Don't drop trailing slashes</h1><a href=\"$xp\">corrected path</a></body>\r\n";
	die "webrcs: slash-redirect to $xp";
    }
}
$rcspath = "$cvsroot/$cvsbase/$filename,v";
&status(404, "not found in repository ($rcspath)") if ! -r "$rcspath";

Next, we start acting like Apache would given the path. If what we were handed is actually a directory in the repository, first check if it had a trailing slash; if it doesn't, redirect the client to the corrected form of the URL, so later relative-URL references work correctly. If the slash is in place, rewrite it into a reference to index.html the way that the DirectoryIndex directive would. Then the script constructs the actual in-repository path, and generates a specific error message if it can't find that.

%ext = ("gif" => "image/gif",
	"jpg" => "image/jpeg",
	"png" => "image/png",
	"html" => "text/html",
	"txt" => "text/plain",
	"prc" => "application/octet-stream", # pilot programs
	"pdb" => "application/octet-stream", # pilot databases
	"tgz" => "application/octet-stream", # gzipped tar file
	"ps" => "application/postscript",
	"mpeg" => "video/mpeg");
($ext) = ($filename =~ m/\.([a-z]*)$/);
$ct = $ext{$ext};
$ct = "text/plain" unless defined $ct;

Here the script strips off the extension (safely) and looks it up in a lookup table of MIME content types. If it doesn't find one, it defaults to text/plain though perhaps application/octet-stream if we can tell that it isn't text. Rather than extend this much further, however, this code should be replaced with code to parse /etc/mime.types instead, for compatibility (and leverage) with the native server.

$| = 1;
print qq{Content-type: $ct

};
system("/usr/bin/co", "-p", "$rcspath");

Finally, we force flushing on standard output and emit the Content-type MIME header chosen above. Then we emit the actual content of the file in the repository... and we're done.

Later Enhancements

More Headers

webrcs should probably generate more than just a Content-Length header to make the output more realistic and help caches deal better. For example, the timestamp of the repository file is a potentially useful worst-case value for the Last-Modified header, which could be refined later by parsing the RCS headers directly.

Deleted Files

It should be possible to make webrcs deal correctly with files that have been cvs removed in new versions of CVS that don't use the Attic directory.

Status Reporting

The status function could include some additional text specified in the conf file, perhaps indicating the owner of this repository or giving other advice.

Directive Mimicry

webrcs mimics (or should) the behaviour of several apache keywords, including

DirectoryIndex
FancyIndexing
AddType
AddEncoding

It would be ideal if the gateway would parse the settings out of the Apache config files, though having some other tool drop them in the webrcs config file would be a reasonable compromise.

Repository Reading

There are a number of improvements we might be able to make to the actual extraction of the file from the repository:

exec instead of system
Direct code to parse the repository file
Better error handling at that point
Include an RCStag from the configuration file
Figure out the actual length in advance
Better use of RCS admin bits, like -kb to choose default MIME-type

Note that several of these changes would be simpler if the gateway were rewritten to call C code directly, as they could more easily share code with CVS or RCS directly.

Better HTTP/CGI support

It would be more efficient to support HEAD directly, instead of having the server discard output. Supporting FastCGI might also be valuable.

Unpacked Files

One problem with storing CGI programs is that they generally can't be run dynamically from a checkout. A temporary version could be checked out, which is somewhat complex. A more reliable approach might be to use the checkoutlist feature to cause the CGI to be kept up-to-date directly in the repository at commit time.

CVS commit processing

CVS also supports a commitinfo interface to processing files before check-in. This would allow tools like weblint, HTML::Clean, or other analysis or style tools to be run on any commited modification.

Last processed: 1999-06-09T00:39:10