Comments on: Hadoop, EC2, S3, and me

By: Etienne Posthumus

Etienne Posthumus — Tue, 25 Mar 2008 09:26:20 +0000

Just another plug for boto, it is finely crafted. Especially if you want to start transferring large files that can’t be .read() and sent via HTTP in a single go. It does much more graceful handling of things than the sample S3 code from AWS.

And it is worthwhile making your own images, you can have a bunch of favorite packages pre-installed and configured which can save you time.

By: m0j0

m0j0 — Sat, 22 Mar 2008 19:44:35 +0000

Hi Arthur,

Yes, I’ve seen boto, and played with it briefly. To be honest, while I guess I’m a python geek, as a general rule I’m really more interested in getting work done than supporting a particular language (or technology in general). I personally found boto to be really awkward to use.

I was introduced to it here –> http://jimmyg.org/2007/09/01/amazon-ec2-for-people-who-prefer-debian-and-python-over-fedora-and-java/

That article is a fantastic introduction to ec2 in general, but in following the boto bits, I found it to be a rather incomplete tool if your goal is to quickly fire off administrative tasks. It can probably be a good foundation for putting together a suite of python-based tools to mimic Amazon’s own ec2-* tools, but those seem to work well enough.

Regardless, I appreciate you bringing it up again, because I *had* actually missed somehow that it supports moving files from s3 to ec2. Thanks for the reminder!

By: Arthur

Arthur — Sat, 22 Mar 2008 18:03:57 +0000

I take it you've seen boto? For a python geek that's probably the easiest way to get stuff from s3 onto ec2. I've taken a copy of Dug Song's s3tools and adapted s3ftp.py to use boto, so tend to just use that (or a varient) to get what I need if I'm doing ad-hoc command line stuff.

By: Justin

Justin — Fri, 21 Mar 2008 17:57:36 +0000

hmm, I am re-reading the hadoop docs on this:

http://hadoop.apache.org/core/docs/r0.15.2/streaming.html#Large+files+and+archives+in+Hadoop+Streami

It says:

“the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) will be the value.”

and the same for reducing. This seems to imply that it does sort and merge the data, if it does, then something like this should work for the reducer:

#!/usr/bin/env python
import sys
import itertools
from operator import itemgetter

def get_file(f):
for line in f:
yield line.split()

def main():
data = get_file(sys.stdin)
for word, counts in itertools.groupby(data, itemgetter(0)):
tot = sum(int(x[1]) for x in counts)
print “%s\t%d” % (word, tot)

if __name__ == “__main__”:
main()

Notice how this is much more along the lines of the java code that does the same thing. The only diference is with the java code, mapreduce does the “groupby” for you.
wordpress will probably butcher it, if it does I’ll just make a blog post with it 🙂

By: m0j0

m0j0 — Fri, 21 Mar 2008 15:15:58 +0000

Thanks, Justin.

The point of the article, for me anyway, was that you didn’t have to code in Java to use Hadoop. For me, that’s the big win, because having coded in Java in the past, I will avoid it at all costs, even if it means implementing something like Torque and Maui instead of Hadoop to get my work done.

That said, your comment is still spot on for those Pythonistas who might’ve been taking that simplistic example as a “best practices” document for how to code the actual mapreduce process.

So, Justin, can you provide some Python code that would improve upon that example? Does anyone?