Hadoop, EC2, S3, and me

I’m playing with a lot of rather large data sets. I’ve just been informed recently that these data sets are child’s play, because I’ve only been exposed to the outermost layer of the onion. The amount of data I *will* have access to (a nice way of saying “I’ll be required to wrangle and munge”) is many times bigger. Someone read an article about how easy it is to get Hadoop up and running on Amazon’s EC2 service, and next thing you know, there’s an email saying “hey, we can move this data to S3, access it from EC2, run it through that cool Python code you’ve been working with, and distribute the processing through Hadoop! Yay! And it looks pretty straightforward! Get on that!”

Oh joyous day.

I’d like to ask that people who find success with Hadoop+EC2+S3 stop writing documentation that make this procedure appear to be  “straightforward”. It’s not.

One thing that *is* cool, for Python programmers, is that you actually don’t have to write Java to use Hadoop. You can write your map and reduce code in Python and use it just fine.

I’m not blaming Hadoop or EC2 really, because after a full day of banging my head on this I’m still not quite sure which one is at fault. I *did* read a forum post that someone had a similar problem to the one I wound up with, and it turned out to be a bug in Amazon’s SOAP API, which is used by the Amazon EC2 command line tools. So things just don’t work when that happens. Tip 1: if you have an issue, don’t assume you’re not getting something. Bugs appear to be fairly common.

Ok, so tonight I decided “I’ll just skip the whole hadoop thing, and let’s see how loghetti runs on some bigger iron than my macbook pro”. I moved a test log to S3, fired up an EC2 instance, ssh’d right in, and there I am… no data in sight, and no obvious way to get at it. This surprised me, because I thought that S3 and EC2 were much more closely related. After all, Amazon Machine Images (used to fire up said instance) are stored on S3. So where’s my “s3-copy” command? Or better yet, why can’t I just *mount* an s3 volume without having to install a bunch of stuff?

This goes down as one of the most frustrating things I’ve ever had to set up. It kinda reminds me of the time I had to set up a beowulf cluster of about 85 nodes using donated, out-of-warranty PC hardware. I spent what seemed like months just trying to get the thing to boot. Once I got over the hump, it ran like a top, but it was a non-trivial hump.

As of now, it looks like I’ll probably need to actually install my own image. A good number of the available public images are older versions of Linux distros for some reason. Maybe people have orphaned them and gone to greener pastures. Maybe they’re in production and haven’t seen a need to change them in any way. I’ll be registering a clean install image with the stuff I need and trudge onward.

  • http://www.bouncybouncy.net Justin

    The python example for hadoop isn’t really mapreduce. It doesn’t manipulate the data the way mapreduce should, and won’t scale. The java implementation of that same example does it correctly…

    Basically, they shouldn’t be using a hash table to store the counts, they should be merging the sorted output together, then simply looping through and summing the counts for each word. Their example will fail to run as soon as the number of distinct words no longer fits in memory.

  • http://www.protocolostomy.com m0j0

    Thanks, Justin.

    The point of the article, for me anyway, was that you didn’t have to code in Java to use Hadoop. For me, that’s the big win, because having coded in Java in the past, I will avoid it at all costs, even if it means implementing something like Torque and Maui instead of Hadoop to get my work done.

    That said, your comment is still spot on for those Pythonistas who might’ve been taking that simplistic example as a “best practices” document for how to code the actual mapreduce process.

    So, Justin, can you provide some Python code that would improve upon that example? Does anyone?

  • http://www.bouncybouncy.net Justin

    hmm, I am re-reading the hadoop docs on this:

    http://hadoop.apache.org/core/docs/r0.15.2/streaming.html#Large+files+and+archives+in+Hadoop+Streami

    It says:

    “the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) will be the value.”

    and the same for reducing. This seems to imply that it does sort and merge the data, if it does, then something like this should work for the reducer:

    #!/usr/bin/env python
    import sys
    import itertools
    from operator import itemgetter

    def get_file(f):
    for line in f:
    yield line.split()

    def main():
    data = get_file(sys.stdin)
    for word, counts in itertools.groupby(data, itemgetter(0)):
    tot = sum(int(x[1]) for x in counts)
    print “%s\t%d” % (word, tot)

    if __name__ == “__main__”:
    main()

    Notice how this is much more along the lines of the java code that does the same thing. The only diference is with the java code, mapreduce does the “groupby” for you.
    wordpress will probably butcher it, if it does I’ll just make a blog post with it :)

  • Arthur

    I take it you’ve seen boto?

    For a python geek that’s probably the easiest way to get stuff from s3 onto ec2. I’ve taken a copy of Dug Song’s s3tools and adapted s3ftp.py to use boto, so tend to just use that (or a varient) to get what I need if I’m doing ad-hoc command line stuff.

  • http://www.protocolostomy.com m0j0

    Hi Arthur,

    Yes, I’ve seen boto, and played with it briefly. To be honest, while I guess I’m a python geek, as a general rule I’m really more interested in getting work done than supporting a particular language (or technology in general). I personally found boto to be really awkward to use.

    I was introduced to it here –> http://jimmyg.org/2007/09/01/amazon-ec2-for-people-who-prefer-debian-and-python-over-fedora-and-java/

    That article is a fantastic introduction to ec2 in general, but in following the boto bits, I found it to be a rather incomplete tool if your goal is to quickly fire off administrative tasks. It can probably be a good foundation for putting together a suite of python-based tools to mimic Amazon’s own ec2-* tools, but those seem to work well enough.

    Regardless, I appreciate you bringing it up again, because I *had* actually missed somehow that it supports moving files from s3 to ec2. Thanks for the reminder!

  • http://pos.thum.us/ Etienne Posthumus

    Just another plug for boto, it is finely crafted. Especially if you want to start transferring large files that can’t be .read() and sent via HTTP in a single go. It does much more graceful handling of things than the sample S3 code from AWS.

    And it is worthwhile making your own images, you can have a bunch of favorite packages pre-installed and configured which can save you time.