Amazon S3 - a large, far away, widely distributed gadget

2013-05-17 23:51:58 -0400

Originally posted on tumblr

Amazon S3 counts as a gadget, right? :-) I've been using it professionally for a while, and of course many of the services we take for granted until us-east-1 goes down use it too. Turns out that you can hook it in to a homebrew website without very much work...

The other day I traced a period of terrible performance (8s network latency getting out of the house) to a visit from Googlebot-Video/1.0 fetching an old AVI file from a post-hoc image stabilization project (now made mostly redundant by youtube's builtin stabilization feature.) The file was about 50M, and anyone interested in the project really wants the less-compressed original, shoving it to youtube really doesn't help... but it turns out that that's tiny by Amazon S3 standards, and the free tier covers it just fine.

There were a surprisingly small set of steps; I'm posting them here with the actual domains involved, since they're visible and public anyway, you just need to convert them to your own needs...

Create yourself an AWS account. (You already shop at amazon, right? Just log in and create one...)
"Better get a bucket." Go to the console web page, pick S3, hit "create bucket". Name it something obvious; in my case, avi.thok.org although something more generic like s3.thok.org would have been a common choice. Do this first, because the bucket namespace is global and isn't checked against DNS registration at all, so there's a very faint chance someone already has a bucket of that name; at this stage, if you find a collision you can just pick a different name, like s3-namespaces-can-you-speak-it.thok.org.
Install s3cmd (just git clone the github version and run it from the checkout - the one in ubuntu doesn't actually handle puts with redirects.)
Configure it: s3cmd --configure and get the Access and Secret keys from the console under "security"; don't bother to configure encryption or https because these are files that are already available by http, you don't want to deal with certificates, and you'll check the md5sums later.
Copy your files. Note that s3 doesn't have directories per se; you just put the path with slashes in place as you go, so s3cmd put --no-encrypt kicx1440.avi s3://avi.thok.org/me/publish/europython/day2/kicx1440.avi works just fine, without having to do anything about me/publish directly.
Make them world-readable. By default S3 is, correctly, private; s3cmd setacl --acl-public s3://avi.thok.org/me/publish/europython/day2/kicx1440.avi makes that single file public. At this point, there's a long convoluted url that will fetch this file, and you could stop here and just change the html that points to it, but let's handle this cleanly...
Edit your DNS zone and add avi IN CNAME s3.amazonaws.com. Carlton Bale gets credit for having the first google hit that actually said this would work. Once you've pushed this through, curl -L -v -I http://avi.thok.org/me/publish/europython/day2/kicx1440.avi works - note carefully, the -I gets curl to do a HEAD (-H was already taken?) so you get back headers, not 100m of video. You should see the Location header taking you over to S3, and then a convincing ETag (md5sum of the file, in this particular case) and Content-Length.
Edit your apache config and add RewriteRule ^/(me/.*\.avi)$ http://avi.thok.org/$1 [R,L] To pick this apart:
RewriteRule is the apache swiss-army-knife of URL mangling.
The first bit is a regular expression that matches the entire "path" (^ for start, $ for end) and grabs everything after the leading slash (thus the slash is outside the grouping parentheses.) Within this part of the path, it has to start with me/ and end with .avi but can have anything at all in between; if we wanted literally all AVI files, we'd drop the me/ part, but I have some small ones elsewhere on the site that I didn't want to bother hunting down and uploading.
The second bit is the new URL - avi.thok.org to point to the CNAME we set up above, $1 is the first set of parentheses in the match (so, me/xxx.avi.)
Finally, the last bit is what to do with this big of hatchet work; R says to make it a redirect (and because our result starts with http it automatically becomes an "external" redirect, in this case a 302, ie. "don't try to fetch this url, just tell the client to go away and find it themselves." You can't get theyah from heah, but you can get there from over there... the L is for "last" and just says to stop trying and don't do any more rewriting on this particular result.
Don't forget to actually /etc/init.d/apache2 reload or however your system spells that. At this point, you can curl -L -v -I http://www.thok.org/me/publish/europython/day2/kicx1440.avi (note that we're actually starting with the primary domain here, where the original problem started) and follow our HTTP/1.1 302 Found and then amazon's HTTP/1.1 307 Temporary Redirect and the bandwidth problem (remember the bandwidth problem? This song's about a bandwidth problem) is now gone.

Future refinements:

use [R=307] and make the first hop a Temporary Redirect as well. Not sure if that's correct, yet, but given that this all started with a search engine bot that wasn't aware of the human-readable "slow (home)" and "fast (MIT)" alternate links, it's worth looking into.
if thok.org were more of a CMS, automatically noticing avi files and pushing them to amazon would be a good transparent trick. For a total of five files on a home website? Not actually worth the trouble, even if the logs say I have at least a month before the bot comes around again :-)
actually process the logs by object size and see if anything else should get this treatment; in practice, these files got noticed so they're the right starting point, and this isn't anything you'd mistake for a major site.