Simple S3 Log Archival

UPDATE: if anyone knows of a non-broken syntax highlighting plugin for wordpress that supports bash or some other shell syntax, let me know :-/

Apache logs, database backups, etc., on busy web sites, can get large. If you rotate logs or perform backups regularly, they can get large and numerous, and as we all know, large * numerous = expensive, or rapidly filling disk partitions, or both.

Amazon’s S3 service, along with a simple downloadable suite of tools, and a shell script or two can ease your life considerably. Here’s one way to do it:

  1. Get an Amazon Web Services account by going to the AWS website.
  2. Download the ‘aws’ command line tool from here and install it.
  3. Write a couple of shell scripts, and schedule them using cron.

Once you have your Amazon account, you’ll be able to get an access key and secret key. You can copy these to a file and aws will use them to authenticate operations against S3. The aws utility’s web site (in #2 above) has good documentation on how to get set up in a flash.

With items 1 and 2 out of the way, you’re just left with writing a shell script (or two) and scheduling them via cron. Here are some simple example scripts I used to get started (you can add more complex/site-specific stuff once you know it’s working).

The first one is just a simple log compression script that gzips the log files and moves them out of the directory where the active log files are. It has nothing to do with Amazon web services. You can use it on its own if you want:

#!/bin/bash

LOGDIR='/mnt/fs/logs/httplogs'
ARCHIVE='/mnt/fs/logs/httplogs/archive'
cd $LOGDIR
if [ $? -eq 0 ]; then
for i in `find . -maxdepth 1 -name "*_log.*" -mtime +1`; do
gzip $i
done

mv $LOGDIR/*.gz $ARCHIVE/.
else
echo "Failed to cd to log directory"
fi

Before launching this in any kind of production environment, you might want to add some more features, like checking to make sure the archive partition has enough space before trying to copy things to it and stuff like that, but this is a decent start.

The second one is a wrapper around the aws ‘s3put’ command, and it moves stuff from the archive location to S3. It checks a return code, and then if things went ok, it deletes the local gzip files.

#!/bin/bash

cd /mnt/fs/logs/httplogs/archive
for i in `ls *.gz`; do
s3put addthis-logs/ $i
if [ $? -eq 0 ]; then
echo "Moved $i to s3"
rm -f $i
continue
else
echo "Failed to move $i to s3... Continuing"
fi
done

I wish there was a way in aws to check for the existence of an object in a bucket without it trying to cat the file to stdout, but I don’t think there is. This would be a more reliable check than just checking the return code. I’ll work on that at some point.

Scheduling all of this in cron is an exercise for the user. I purposely created two scripts to do this work, so I could run the compression script every day, but the archival script once every week or something. You could also write a third script that checks your disk space in your log partition and runs either or both of these other scripts if it gets too high.

I used ‘aws’ because it was the first tool I found, by the way. I have only recently found ‘boto‘, a Python-based utility that looks like it’s probably the equivalent of the Perl-based ‘aws’. I’m happy to have found that and look forward to giving it a shot!

  • http://standalone-sysadmin.blogspot.com Matt Simmons

    This is interesting.

    What sort of things do you use AWS for? Are you using S3 and EC2?

  • m0j0

    Hi Matt,

    I use S3 for log archival, and soon I’ll be moving database backups (old ones) to S3 as well. Other things might migrate there over time as I become better acquainted with the service. I’m still not confident enough to host anything “live” there, but I know people are doing that.

    I started out by diving head first into Hadoop, S3, EC2… the works. What I found is that it requires you to really immerse yourself in the ways of AWS. It’s a great service, but there’s a lot of stuff that isn’t done – mostly in the area of administrative tools. I also had a couple of conceptual problems relating to failover/redundancy/architecting-for-failure within the EC2 service environment, and I had so many other things on my plate (still do, unfortunately) that this has been put on the back burner for the time being.

    Things are progressing rapidly in EC2-land. Soon we’ll have persistent storage, which actually solves a number of the other issues I had with EC2 (via hacks that would rely on persistent storage), and IP addresses that are at least somewhat predictable. People are also writing better tools to manage all of this stuff, and there’s more reading material about how to make running services in that environment a less sanity-depleting experience. :)

  • http://standalone-sysadmin.blogspot.com Matt Simmons

    Cool, thanks for the rundown on it!

    I’m going to keep an eye on this. We recently invested in 20 blades for our primary and secondary sites, and a 50k SAN, so I don’t think we’d use it this round, but in the future I can see where this would be very handy.

    How long have you been using it?

  • http://bzimmer.ziclix.com Brian Zimmer

    For the syntax highlighting, you might want to try Pygments, which claims to support bash. I wrote about my use of it here though I don’t use it as a plugin.

  • http://weblog.bluepenguin.us Paul Holbrook

    It’s a clever idea, but how cost effective is it? At .15/GB/month, 100 gig of S3 storage costs you $180 a year.

    I guess it depends on your alternatives. Certainly commodity hard drives are far cheaper, but enterprise level SAN/NAS storage is much more.