Data munging with Vim and AWK

So, I had some data in a file. It was temporal data. It looked like this:

100 4/15 16:50
143 4/15 16:51
121 4/15 16:52
209 4/15 16:53
105 4/15 16:54
321 4/15 16:55
173 4/15 16:56
205 4/15 16:57
197 4/15 16:58
211 4/15 16:59

But I needed it to be in ISO 8601 format so I could plot it with Timeplot. The data represents hits per minute from an Apache log file. I also needed the time to show up in the first column and the hits in the second column. Here’s what I needed the data to look like:

2008-04-15T16:54 105
2008-04-15T16:55 321
2008-04-15T16:56 173
2008-04-15T16:57 205
2008-04-15T16:58 197
2008-04-15T16:59 211

Well, I knew that the dates I had in the file were from 2008, and all of the other bits are there, just in the wrong format. Here’s what I did to get things in the right format for Timeplot:

:%s/4\//2008-04-/g # search for “4/” and replace it with “2008-04-”

Now my data looks like this:

173 2008-04-15 16:56

But not all of the minutes were two digits for some reason (I don’t remember how I parsed the log to get into this state – it was hurried and… well… wrong). I had times that looked like “17:9″ so I had to zero-pad the minutes that were only single digits.

:%s/:\(.\)$/:0\1/g # find “:” followed by some character and the end of the line, and replace that with a “0″ followed by whatever that character was.

So now my minutes look right.

144 2008-04-15 16:09

Now I needed to replace spaces between the date and time values with a “T” as per ISO 8601 rules for date and time representations in a single string:

:%s/\(-..\)\s/\1T/g # find a “-” followed by any two characters, followed by a space, and replace it with whatever those two characters were, followed by a “T”.

That worked well.

213 2008-04-15T16:45

At this point I had everything knocked, but I forgot that some of my *hours* were also single digits :-/

:%s/T\(.\):/T0\1:/g # find a “T” followed by a single character, and replace that with “T0″ and whatever that character was.

There. That did it. Now I just need to comma-separate the values, which is simple after all of this nonsense:

:%s/ /,/g # c’mon, you get this one, right?

Great! Except that the datetime string needs to be the *first* column. Here’s where awk comes in handy:

cat hitspermin_bad.txt | awk -F, ‘{print $2,$1}’ > hitspermin_good.txt

You’ll notice that, since I could see the data and know the source, I didn’t bother explicitly telling Vim to look for *numbers* – I just used “.” to say “find any character”. If I had less confidence in the data I would’ve used “\d” to make sure I had numeric digits there.

Of course, the better solution is to properly parse the log file in the first place, but the log file in this case was 25GB!! Of course I’ll go back and change my script (I used loghetti with a custom (read: flawed) output filter), and test it on smaller data, and eventually get it to be more reliable, but to get a quick Timeplot graph together, this was a fast, if iterative and somewhat annoying, way to go. It also gave me a chance to exercise my Vim search and replace skillz.

What Makes a Good Sysadmin?

I went to the June meeting of the LOPSA-NJ group tonight, where we held a roundtable discussion talking about traits that make a good system administrator. Here are some of the traits we came up with.

Adaptability

Organizations are not static. Nor are their IT departments. If you’re in a big company, being adaptable might mean keeping your job through cutbacks, because from a business perspective, ‘adaptable’ might mean ‘repurposable’. Adaptability also means you have enough of a technical background to apply prior knowledge and experience to new problems, new paradigms, new platforms, etc.

I once was on a client site (around 2000-2001), on the last day of a project wrap-up, and I was talking with a business manager and the company’s IT director about their Exchange issues. I suggested that, since they had several Solaris and Linux boxes, and weren’t using any fancy features of Exchange to lock them to it, they might consider using Sendmail instead. After answering a few questions and talking a bit more, they decided to take it under advisement. I was to return in two weeks for a follow-up on a different project, and we’d discuss it then.

When I returned, the two guys whose jobs were to maintain the Exchange servers were gone. Turns out, the IT director replaced the Exchange servers they were used to with freshly ground Sendmail servers after hours one night, and they came back in in the morning to find… no Exchange servers! Just an old Red Hat welcome screen on the server terminal!

The first admin left the same day. The second one tried to stick it out, but it didn’t go well, and he left as well. Of course, there are other issues at play in this scenario, but one of them is adaptability. Besides, I love that story.

Inquisitive

This is a trait that came out of discussions that revolved around things that could’ve gone better if the admin didn’t assume that what they were doing was the right way to do things. I think it boils down to being conscious of one’s own ignorance, and reaching out to resources or people that can help you be more confident that you’re taking the right approach.

The other half of this trait is being genuinely curious about technology, and having a passion for exploring technology. Keeping up with new tools and techniques is, imho, an essential part of being a good system administrator. Every different technology you learn about makes your knowledge of all the other ones that much deeper, and being inquisitive and poking at new tools is a part of a sysadmin’s “lifelong learning” process.

Pragmatism

While it’s great to know about the latest developments in the sysadmin space, there should be a clear separation between what’s “up and coming” and what is “production ready”. Just because that “0.01 alpha” web server does something in a way that is theoretically better than Apache does not merit the replacement of the Apache servers that are running your company’s site. Telling management that it is will likely be a near-term career limiting move. If you like the technology, follow it. Contribute to it. Document it. Help to make it better in some way. Run your own site with it, and try to get your friends to try it out. Put yourself on the mailing lists where people go to get help or complain about it. In short, invest some of yourself in it. When/if it finally catches on, you’ll be one of very few experts around. This is exactly what happened to me with Linux ;-)

Grace Under Fire

Working well under pressure is essential, especially when all of the things you’ve done to help avoid pressure-cooker situations fail you. If you can’t keep a clear head when the world is crashing all around you, you’re more likely to make things worse than better. Perhaps this comes with age or seasoning, or maybe it’s a learned process – I don’t know which. I *do* know that I used to get really downright angry with myself when I couldn’t figure out a problem or when something I did caused hardship in some way, but as I gained experience, or age, or a greater level of introspection, my tendency to get worked up beyond a normal sense of urgency disappeared.

Personality

Not all sysadmins are particularly good at dealing with less technical folk. Heck, some sysadmins aren’t good at dealing with other technical folks… or anyone else! Some of them know it, but some of them *don’t*, and I’m not sure which is worse. Nobody, in any job role, wants to work with a grump, or a naysayer, or a zealot, or an arrogant bastard who thinks all users are idiots who don’t deserve to use their precious technology, and certainly are not worthy of their attention and support.

Yeah, BOFH is a great comic. But what makes me cringe sometimes is that I’ll read something there and think “someone out there has done this, and he has the same title as me”.

Time Management

Well, we’re a group of systems administrators who live in NJ, and Tom Limoncelli is a member. He didn’t make this month’s meeting, so we felt obligated to mention time management as an essential sysadmin skill. It’s not that we don’t genuinely believe it, it’s just that we’ve all seen all of Tom’s talks, multiple times, and we all know the gospel by now, so well that this doesn’t even need to be said. The surprising thing is how long it took for someone to break down and say it.

Time management is important. What’s just as important is how you deal with other people in a way that helps you manage your time, which Tom covers quite well in his book, but you don’t see in many book reviews. If you haven’t read it, and you’re not already on one of the various time management bandwagons, I highly recommend Time Management for System Administrators, as it does a great job of boiling down what’s in all of the other fluffy time management books out there, and putting all of that stuff in the context of what we do for a living.

Problem Solver

Most of the conversation over the course of the evening consisted of stories about people involved in interesting situations with technology. But really, the stories were about the people, not the technology. This attribute managed to squeeze a single technical requirement out of me, personally, and I think after some refining I narrowed that technical requirement down to something like “should be able to solve problems using code, in the absence of a more practical/available solution”. Of course, the “Pragmatism” requirement should enable a good sysadmin to decipher when code is the best route to take.

Outside of that, the generic ability to rationally think through a problem to get at its root and then solve it is invaluable. Not only that, it is essential. Required for success, even. While I think our jobs as sysadmins put this skill pretty high on the list of required skills, I don’t think that you necessarily have to have learned problem solving in that context, and most probably don’t. Good problem-solving skills are honed from childhood for some, and are learned in later years for others. Either way, the skill probably isn’t a result of being a sysadmin. Quite the contrary, good problem solvers might be attracted to system administration because the job is perhaps viewed as primarily a problem-solving (and maybe, thereby, heroic?) role.

There are entire shelves in book stores about how to become a better problem solver. They’re just not typically in the geek section of the book store, because (ta-da!) problem solving is a much more generic, high level topic. Not that someone couldn’t write a problem-solving book for sysadmins or something (hmmmm….).

O’Reilly OSCON… and Brew Fest!

I’m going to the O’Reilly Open Source Convention (OSCON) again this year. I went in 2006 as well, and had a blast, in addition to learning quite a bit, and meeting tons of people whom I’ve been acquainted with online for a long time. That was 2 years ago. Since then I’ve been acquainted with lots *more* people online, and I’m hoping I’ll meet at least some of them this year.

If you’re not going to OSCON, you’re not only missing out on a great technical conference that will leave you physically tired from all of the activity and at the same time unable to sleep from the ideas sparked by the day’s events, you’re also missing the Oregon Brewers Festival, which takes place just as OSCON is wrapping up.

I have a medium-sized home brewery that a buddy and I built from scratch. Over the years we’ve brewed and tasted all kinds of beer. But you can’t get all beers everywhere, so traveling is a good opportunity to taste wild and exotic beers, or just local beers you can’t get at home. It’s odd, but while you can get easy access to beers from Germany, Belgium, Poland, England, Scotland, and Ireland, you would be hard pressed to find a good number of great beers from the West Coast of the US on the East Coast of the US. And the West Coast has a lot happening, beer-wise.

Beer festivals are also where some brewers pull out all the stops. In ’06 I went with a buddy and actually had not one, but TWO different watermelon beers – a variety I had not even heard of until I showed up at the counter. One was pretty good, the other tasted like Watermelon Bubblicious, but the experience was fantastic. Every have rock candy made from hops? Pretty good I tell you!

Anyway, I was thinking of getting a larger group together to attend this years Brew Fest, so if any geeks out there have an interest in beer, let me know. And if you DON’T have an interest in beer, you should DEFINITELY let me know. I’ve converted numerous friends and family who say they don’t like beer to becoming more familiar with styles they actually will go out and buy, unprovoked, voluntarily! Saying you don’t like beer is like saying you don’t like food. There’s just too many kinds of beer to say you don’t like beer. Maybe you don’t like hops, in which case you might like hefeweizen, but have probably never heard of it. Maybe you don’t like really fizzy beer, in which case you might like various Belgian ales, a Barleywine, a porter, or any beer with a less fizzy, more creamy, or less prevalent head on it.

Anyway, I’m going, and it’s fun. If you have an interest, do join in, whether you go with a group I put together or not!

Notes on Book Shopping from a Tech Bibliophile

Hi. My name is Brian, and I’m a tech bibliophile.

I have owned more books covering more technologies than I care to admit. Some of my more technical friends have stood in awe of the number of tech books I own. I am also constantly rotating old books that almost *can’t* be useful anymore out of my collection because there’s just no room to keep them all, and it would be an almost embarrassingly large collection if not for the fact that I have no shame or guilt associated with my need for dead trees.

If you need further proof:

  • I have, on more than one occasion, suggested to my wife that we take a walk around our local mall so I could browse the computer section of the book store, not to buy, but just to keep up with the new titles and stuff.
  • Ok, I usually buy.
  • I also go into book stores whenever I’m out of town to get a comparison of what seems to be popular in different areas of the state/country/world.
  • I just got a head rush because I just remembered that, since I’m attending OSCON in Portland, OR this year, I’ll get to go back to Powell’s, which, mind you, has a huge, city-block-sized store, which is very nice, but *also* has an entire store dedicated to geekery that rivals anything like it that I’ve ever seen, and contains a computer museum! (You can see some shots of it in my flickr set from OSCON ’06)
  • I once owned a book about VBScript.

I have also co-authored a book for O’Reilly, and in addition to my day job (I’m the director of IT for AddThis.com), I also work for a publisher, MTA, the publisher of Python Magazine, php|architect, as well as a line of books. Oh yeah. I’m into it. It’s bad.

I’ve learned quite a bit about buying books, and some of that learning came from unexpected places. There’s even more that I don’t know, but at least now I know that I don’t know it, and can try to figure out more stuff :-)

So here are a few things to keep in mind when you need to buy a technical book, or one just tugs at your impulse buy strings.

Give Any New Version 6 Months Before Buying a Book About It.

The first books about PHP 5 were dreadful. I never, ever return books to a book store, even if I don’t particularly care for them, but I returned a book about PHP 5 because the level of inadequacy was just insulting to me as a consumer. This was quite some time ago (when PHP 5 books first hit the shelves), and thinking about it now I’m still amazed at how terrible that book was. Of course, PHP 5 is just one example. Way, way back in the day (1998-9 or so?) when the first books about Java 2 hit the shelves (some might remember that booksellers actually put stickers over the part of the title that said “1.2″ when it was renamed “2″), I had the same experience.

It’s not exclusive to languages either. When the first MySQL books came out that said “covers mysql 5″, they just barely covered MySQL 5. In fact, there’s a new edition of High Performance MySQL coming out that is *going* to say “covers MySQL 5.1″ on it, and it’s not really going to cover much about 5.1, so says one of the books authors (whose honesty I greatly appreciate, by the way – I’d love to see that from the various book publishers).

At the OS level, I’m mostly a Linux guy, and at this point I wouldn’t take a book about a specific version of a specific distribution of Linux if you paid me to take it. Those books are mostly rehashes of the last version of the book put together as marketing objects. I know, because when the “<distro><version> Bible” series first came out, I read them (I think RedHat was the only distro covered initially), and I followed up with later versions of the books, and was always disappointed. Nowadays, I don’t know how you can think that a book about something as fast moving as Fedora Core is going to be useful. Maybe if you’re learning it for the first time something like this can work out, but if you’re looking to exploit new features, you’re really better off just reading the release notes and changelog.

Lesson learned. Books take time to write, to edit, to format, to print, to distribute, and to get on the shelves. Keep that in mind when you see a book about Python 3000 on the shelves within days of a GA release of Python 3000. It’s likely that that book was completely written and in an editor’s hands 3 months ago, and writing for that project began probably 9 months ago… 9 months before Python 3000 was a reality in this example. Some changes can be accounted for during the writing process, but a book that is released 6 months after the release of a new technology is likely to be built on more solid ground (of course, this is only part of assessing the quality of the book – but I suspect it’s often overlooked).

I’d also like to note that this probably wasn’t the case quite so much in the days when, for each language or technology or application or whatever, there were far fewer titles in print on the topic, and an authoritative title was more easily identified. Nowadays, the number of books about Ruby is dizzying to witness on the shelves of your local retailer. I just don’t think there was a market to support that kind of sensationalistic publishing model back when, say, C++ hit the scene. Maybe I’m mistaken there and some more… distinguished folks can enlighten all of us.

Take reviews with several grains of salt

Book reviews are lame, unless you know the source. When I say “know”, I don’t mean “have heard of”. I mean “know” in the sense that you have some idea what this reviewer is working with on a day-to-day basis, you know what their leanings are within the technological landscape, and you recognize that person as an authority on some topic at least loosely related to the book being reviewed.

I wouldn’t put much faith in the reviews on Amazon unless it is an established title that’s in its second edition. First edition books that all of a sudden have 20 reviews on Amazon within the first week of the release are probably reviews done by other authors who work for the same publisher, or who have some other motivation for writing the review.

You can learn to identify lame reviews or astroturfing on sight (now that you’re aware of it, it’s not all that hard to recognize), so be on the lookout. If you can, google the reviewer by name. Some of those folks work for the same publisher, and should likely just be discarded. I hate astroturfing, but I guess the publishers feel like they have to do it to compete with everyone else who is doing it and creating buzz around their titles. Sad.

By the way, astroturfing in this context means sending everyone you know (and/or works for you, or wants to) to do reviews, talk up the book, link to the book’s web site, or the author’s blog (where his book is probably displayed prominently) or run ads on their blog, or mention the book on irc, digg, del.icio.us, slashdot, etc. If you get enough people to do this, it gives the impression that there’s a lot of buzz and “grass roots” enthusiasm around the book. Except the “grass” is fake. Hence “astroturfing”. This is the kind of thing that Digg fights against all the time. Mostly unsuccessfully. It goes like this:

  1. Big Media Inc publishes article on big web site.
  2. Big Media emails all editors, writers, bloggers, designers, etc., to go blog, talk, post, link, submit the article everywhere they can.
  3. Some of these people have multiple accounts on each service you can possibly submit links to, some have multiple blogs, they link to other peoples’ blogs who are also talking up the article… you get the idea.
  4. Big Media’s article is read by thousands, for no really good reason other than they happen to be good astroturfers.

…But I digress. Just take reviews with a grain of salt. Same goes for big numbers on Digg and other like services.

Look for “Timeless Tomes”

The K&R book on C is a timeless tome. The GoF book on Design Patterns is a timeless tome. Stevens on TCP/IP is a timeless tome. C.J. Date’s early Intro to Database Systems is a timeless tome. These books came out a shockingly long time ago considering how often they are referenced and recommended and handed down through generations of technologists. If you need a solid foundation in some technology like this, you should look for books on the topic that have stood the test of time.

However, time isn’t always your friend, and some of these tomes are enormous. That’s why there are books like “Learn Java in 24 Hours”. If you go after this type of book, fine. I have tons of books like this. Just know that going through it does NOT mean you “know” Java. See here for details.

Timeless tomes seem harder to find now that there are stores with 150,000 titles in stock. They get lost in the noise. They’re out there, though. I have a built-in Amazon storefront on LinuxLaboratory.org that I try to keep updated with books I have read and found genuinely useful. I’m a little behind on that, but the books there are a mix of huge tomes (Understanding and Deploying LDAP is enormous), and useful reference or “contextual” books that explain how to use a technology in a particular context (Perl for System Administration, for example, is a good book). The next book I need to add there is “The Art of SQL”, which completely rocks and I highly recommend if you *already know* SQL.

Look at the Copyright Date

Technology moves at break neck speed. Some books that are still on the shelves that say “PHP and MySQL” cover versions that aren’t even supported anymore. Oracle 8i books are still around. Some books about Apache only make passing references to Apache 2. It would take some time to sit around flipping through pages to figure out if the version you need information about is covered. If you have some familiarity with the subject, checking the copyright date is a quick reference that can let you know if this book is the one you need. It can also help you avoid the dreaded “written before the technology was GA” problem mentioned above. If you know that FooLang 24 came out in February of 2008, the book in your hands that says “FooLang 24″ on the cover should not have a 2007 copyright, ideally.

Be Wary of Growth in Second Editions

First: there are “Volumes” and there are “Editions”. A second volume is a completely different book from the first volume. A second *edition* is an updating of the first edition. It’s the same basic material. Or… that’s how it used to be. Nowadays, marketing sometimes dictates that new editions should include whole new sections about new and exciting buzzwords of the day and stuff like that. Have you seen the most recent edition of “Programming Python”? It’s probably the thickest technology book I own, even beating out Understanding and Deploying LDAP Directories. I have no idea if anything in there was put upon the author by O’Reilly – and I’m not making accusations (I’ve worked for O’Reilly and have no reason to believe they’re guilty of this practice) – I’m just saying that the first edition was probably around half the size of the second.

For what it’s worth, I own the latest edition of Programming Python, and am not sorry I bought it. In my editing work for Python Magazine, I came across code that used seemingly every conceivable Python module, and I had to be able to quickly reference and read up on stuff that was in unfamiliar territory. Of course, we have tech editors (who rock, by the way), but I still needed to make sure the text was explaining things in a way that made sense and didn’t contradict the code (or vice versa). That book covers a ton of stuff, and I was glad to have it.

I’ve worked with a good number of publishers, and I have definitely been encouraged to make mention of different things I had no interest in writing about, because it was good for Google rankings, or blog buzz, or tag clouds, or whatever. I have friends in tech publishing circles (and tech authors) who have confirmed that this *does* happen.

Understanding that publishers, no matter how granola they look, run businesses, and businesses need to grow and make money, which is an enormously large feat to pull off in publishing. Eventually, they hire marketing people, and priorities can conflict, and bad things can happen. This is not a diatribe against the publishers. It’s a guide for the reader and technical bibliophiles.

My $.02.

As usual, the more information the better, so share your thoughts!!

Simple S3 Log Archival

UPDATE: if anyone knows of a non-broken syntax highlighting plugin for wordpress that supports bash or some other shell syntax, let me know :-/

Apache logs, database backups, etc., on busy web sites, can get large. If you rotate logs or perform backups regularly, they can get large and numerous, and as we all know, large * numerous = expensive, or rapidly filling disk partitions, or both.

Amazon’s S3 service, along with a simple downloadable suite of tools, and a shell script or two can ease your life considerably. Here’s one way to do it:

  1. Get an Amazon Web Services account by going to the AWS website.
  2. Download the ‘aws’ command line tool from here and install it.
  3. Write a couple of shell scripts, and schedule them using cron.

Once you have your Amazon account, you’ll be able to get an access key and secret key. You can copy these to a file and aws will use them to authenticate operations against S3. The aws utility’s web site (in #2 above) has good documentation on how to get set up in a flash.

With items 1 and 2 out of the way, you’re just left with writing a shell script (or two) and scheduling them via cron. Here are some simple example scripts I used to get started (you can add more complex/site-specific stuff once you know it’s working).

The first one is just a simple log compression script that gzips the log files and moves them out of the directory where the active log files are. It has nothing to do with Amazon web services. You can use it on its own if you want:

#!/bin/bash

LOGDIR='/mnt/fs/logs/httplogs'
ARCHIVE='/mnt/fs/logs/httplogs/archive'
cd $LOGDIR
if [ $? -eq 0 ]; then
for i in `find . -maxdepth 1 -name "*_log.*" -mtime +1`; do
gzip $i
done

mv $LOGDIR/*.gz $ARCHIVE/.
else
echo "Failed to cd to log directory"
fi

Before launching this in any kind of production environment, you might want to add some more features, like checking to make sure the archive partition has enough space before trying to copy things to it and stuff like that, but this is a decent start.

The second one is a wrapper around the aws ‘s3put’ command, and it moves stuff from the archive location to S3. It checks a return code, and then if things went ok, it deletes the local gzip files.

#!/bin/bash

cd /mnt/fs/logs/httplogs/archive
for i in `ls *.gz`; do
s3put addthis-logs/ $i
if [ $? -eq 0 ]; then
echo "Moved $i to s3"
rm -f $i
continue
else
echo "Failed to move $i to s3... Continuing"
fi
done

I wish there was a way in aws to check for the existence of an object in a bucket without it trying to cat the file to stdout, but I don’t think there is. This would be a more reliable check than just checking the return code. I’ll work on that at some point.

Scheduling all of this in cron is an exercise for the user. I purposely created two scripts to do this work, so I could run the compression script every day, but the archival script once every week or something. You could also write a third script that checks your disk space in your log partition and runs either or both of these other scripts if it gets too high.

I used ‘aws’ because it was the first tool I found, by the way. I have only recently found ‘boto‘, a Python-based utility that looks like it’s probably the equivalent of the Perl-based ‘aws’. I’m happy to have found that and look forward to giving it a shot!

Funny what you learn about yourself when you buy an iPhone

Not ripping off xkcd - this is seriously the best graphic I\'ve ever generated.

This is *not* a ripoff of xkcd (though I read that regularly, and so should you) – this is seriously the best graphic I can come up with, and it does the job. Yesterday I looked at doing all kinds of stuff to my iPhone. I wanted to see if I could get Python and a full-fledged Django installation on my iPhone and create the first web 2.0 application created completely from the bathroom. Just kidding, but I wanted to do some pretty evil stuff. Turns out that, as of now, there isn’t a clean simple way to get a lot of stuff to work without hacking the iPhone in some way, or resorting to things that are completely ridiculous. I’m sorry, but there has to be an easier way to get SSH and a terminal on there. It’s just not that critical right now, and this thing was damned expensive.

I decided to wait it out and see what comes our way over the summer.

Couple of Python Design Pattern Links

I’m a little relieved to learn that I know slightly more about design patterns than I initially thought. Still, there’s tons to learn, and I’ve been checking the O’Reilly “Upcoming Titles” list to see if “Design Patterns in Python” ever shows up, but I’m always disappointed (if you know of any such upcoming book, let me know!).

In the meantime, I was able to find a couple of decent links to resources covering DP in a Python context:

Alex Martelli gave a talk on Google Developer Day 2007, and the full, 45-minute video can be found here.

Ryan Ginstrom’s blog has a pretty gentle review of Python implementations of a few of the GoF originals. He doesn’t cover when each pattern is appropriate, but just reading the code and matching it up with a name is nice, because I have my own code that I can now talk to people about using that language.

Explosion at The Planet Causes 9000-server Outage

Here’s the email I received on Saturday from The Planet, where I have some dedicated servers hosted:

Dear Valued Customers:
This evening at 4:55 in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room Thankfully, no one was injured. In addition, no customer servers were damaged or lost.
We have just been allowed into the building to physically inspect the damage. Early indications are that the short was in a high-volume wire conduit. We were not allowed to activate our backup generator plan based on instructions from the fire department.
This is a significant outage, impacting approximately 9,000 servers and 7,500 customers. All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday. Rest assured we are working around the clock.
We are in the process of communicating with all affected customers. we are planning to post updates every hour via our forum and in our customer portal. Our interactive voice response system is updating customers as well.
There is no impact in any of our other five data centers.
I am sorry that this accident has occurred and apologize for the impact.

That’s pretty rough. Lucky for me, nothing I do is solely dependent on any of those machines. However, I think it’s probably pretty common for startups to rely heavily on what amount to single points of failure due to a lack of funds or manpower/in-house skills to set things up in a way that looks something like “best practices” from an architecture standpoint.

Building a startup on dedicated hosting is one thing. Running a production web site at a single site, perhaps with a single machine hosting each service (or even a few or *all* of the services) is something dangerously different. However, building a failover solution that spans outages at the hosting-provider level can also be quite difficult and perhaps expensive. You have to really come up with hard numbers that will help you gauge your downtime tolerance against your abilities and budget. While a restrictive budget might mean less automation and more humans involved in failover, it can be done.

What kinds of cross-provider failover solutions have saved your bacon in the past? I’m always looking for new techniques in this problem domain, so share your ideas and links!