Archive for the ‘Sysadmin’ Category

Python Date Manipulation

Tuesday, July 6th, 2010

This post is the result of some head-scratching and note taking I did for a reporting project I undertook recently. It’s not a complete rundown of Python date manipulation, but hopefully the post (and hopefully the comments) will help you and maybe me too :)

The head-scratching is related to the fact that there are several different time-related objects, spread out over a few different time-related modules in Python, and I have found myself in plenty of instances where I needed to mix and match various methods and objects from different modules to get what I needed (which I thought was pretty simple at first glance). Here are a few nits to get started with:

  • strftime/strptime can generate the “day of week” where Sunday is 0, but there’s no way to tell any of the conversion functions like gmtime() that you want your week to start on Sunday as far as I know. I’m happy to be wrong, so leave comments if I am. It seems odd that you can do a sort of conversion like this when you output, but not within the calculation logic.
  • If you have a struct_time object in localtime format and want to convert it to an epoch date, time.mktime() works, but if your struct_time object is in UTC format, you have to use calendar.timegm() — this is lame and needs to go away. Just add timegm() to the time module (possibly renamed?).
  • time.ctime() will convert an epoch date into nicely formatted local time, but there’s no function to provide the equivalent output for UTC time.

There are too many methods and modules for dealing with date manipulation in Python, such that performing fairly common tasks requires importing and using a few different modules, different object types and methods from each. I’d love this to be cleaned up. I’d love it more if I were qualified to do it. More learning probably needs to happen for that. Anyway, just my $.02.

Mission 1: Calculating Week Start/End Dates Where Week Starts on Sunday

My mission: Pull epoch dates from a database. They were generated on a machine whose time does not use UTC, but rather local time (GMT-4).  Given the epoch date, find the start and end of the previous week, where the first day of the week is Sunday, and the last day of the week is Saturday.

So, I need to be able to get a week start/end range, from Sunday at 00:00 through Saturday at 23:59:59. My initial plan of attack was to calculate midnight of the current day, and then base my calculations for Sunday 00:00 on that, using simple timedelta(days=x) manipulations. Then I could do something like calculate the next Sunday and subtract a second to get Saturday at 23:59:59.

Nothing but ‘time’

In this iteration, I’ll try to accomplish my mission using only the ‘time’ module and some epoch math.

Seems like you should be able to easily get the epoch value for midnight of the current epoch date, and display it easily with time.ctime(). This isn’t quite true, however. See here:

>>> etime = int(time.time())
>>> time.ctime(etime)
'Thu May 20 15:26:40 2010'
>>> etime_midnight = etime - (etime % 86400)
>>> time.ctime(etime_midnight)
'Wed May 19 20:00:00 2010'
>>>

The reason this doesn’t do what you might expect is that time.ctime() in this case outputs the local time, which in this case is UTC-4 (I live near NY, USA, and we’re currently in DST. The timezone is EDT now, and EST in winter). So when you do math on the raw epoch timestamp (etime), you’re working with a bare integer that has no idea about time zones. Therefore, you have to account for that. Let’s try again:

>>> etime = int(time.time())
>>> etime
1274384049
>>> etime_midnight = (etime - (etime % 86400)) + time.altzone
>>> time.ctime(etime_midnight)
'Thu May 20 00:00:00 2010'
>>>

So, why is this necessary? It might be clearer if we throw in a call to gmtime() and also make the math bits more transparent:

>>> etime
1274384049
>>> time.ctime(etime)
'Thu May 20 15:34:09 2010'
>>> etime % 86400
70449
>>> (etime % 86400) / 3600
19
>>> time.gmtime(etime)
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=19, tm_min=34, tm_sec=9, tm_wday=3, tm_yday=140, tm_isdst=0)
>>> midnight = etime - (etime % 86400)
>>> time.gmtime(midnight)
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=140, tm_isdst=0)
>>> time.ctime(midnight)
'Wed May 19 20:00:00 2010'
>>> time.altzone
14400
>>> time.altzone / 3600
4
>>> midnight = (etime - (etime % 86400)) + time.altzone
>>> time.gmtime(midnight)
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=140, tm_isdst=0)
>>> time.ctime(midnight)
'Thu May 20 00:00:00 2010'
>>>

What’s that now? You want what? You want the epoch timestamp for the previous Sunday at midnight? Well, let’s see. The time module in Python doesn’t do deltas per se. You can calculate things out using the epoch bits and some math if you wish. The only bit that’s really missing is the day of the week our current epoch timestamp lives on.

>>> time.ctime(midnight)
'Thu May 20 00:00:00 2010'
>>> struct_midnight = time.localtime(midnight)
>>> struct_midnight
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=140, tm_isdst=1)
>>> dow = struct_midnight.tm_wday
>>> dow
3
>>> midnight_sunday = midnight - ((dow + 1) * 86400)
>>> time.ctime(midnight_sunday)
'Sun May 16 00:00:00 2010'

You can do this going forward in time from the epoch time as well. Remember, we also want to grab 23:59:59 on the Saturday after the epoch timestamp you now have:

>>> saturday_night = midnight + ((5 - dow+1) * 86400) - 1
>>> time.ctime(saturday_night)
'Sat May 22 23:59:59 2010'
>>>

And that’s how you do date manipulation using *only* the time module. Elegant,no?

No. Not really.

Unfortunately, the alternatives also aren’t the most elegant in the world, imho. So let’s try doing this all another way, using the datetime module and timedelta objects.

Now with datetime!

The documentation for the datetime module says:

“While date and time arithmetic is supported, the focus of the implementation is on efficient member extraction for output formatting and manipulation.”

Hm. Sounds a lot like what the time module functions do. Some conversion here or there, but no real arithmetic support. We had to pretty much do it ourselves mucking about with epoch integer values. So what’s this buy us over the time module?

Let’s try to do our original task using the datetime module. We’re going to start with an epoch timestamp, and calculate the values for the previous Sunday at midnight, and the following Saturday at 23:59:59.

The first thing I had a hard time finding was a way to deal with the notion of a “week”. I thought I’d found it in ‘date.timetuple()’, which help(date.timetuple) says is “compatible with time.localtime()”. I guess they must mean that the output is the same as time.localtime(), because I can’t find any other way in which it is similar. Running time.localtime() with no arguments returns a time_struct object for the current time. date.timetuple() requires arguments or it’ll throw an error, and to make you extra frustrated, the arguments it takes aren’t in the docs or the help() output.

So maybe they mean it takes the same arguments as time.localtime(), eh? Not so much — time.localtime() takes an int representing an epoch timestamp. Trying to feed an int to date.timetuple throws an error saying it requires a ‘date’ object.

So, the definition of “compatible” is a little unclear to me in this context.

So here I’ve set about finding today, then “last saturday”, and then “the sunday before the last saturday”:

def get_last_whole_week(today=None):
    # a date object
    date_today = today or datetime.date.today()

    # day 0 is Monday. Sunday is 6.
    dow_today = date_today.weekday()

    if dow_today == 6:
        days_ago_saturday = 1
    else:
    # If day between 0-5, to get last saturday, we need to go to day 0 (Monday), then two more days.
        days_ago_saturday = dow_today + 2

    # Make a timedelta object so we can do date arithmetic.
    delta_saturday = datetime.timedelta(days=days_ago_saturday)

    # saturday is now a date object representing last saturday
    saturday = date_today - delta_saturday

    # timedelta object representing '6 days'...
    delta_prevsunday = datetime.timedelta(days=6)

    # Making a date object. Subtract the days from saturday to get "the Sunday before that".
    prev_sunday = saturday - delta_prevsunday

This gets me date objects representing the start and end time of my reporting range… sort of. I need them in epoch format, and I need to specifically start at midnight on Sunday and end on 23:59:59 on Saturday night. Sunday at midnight is no problem: timetuple() sets time elements to 0 anyway. For Saturday night, in epoch format, I should probably just calculate a date object for two Sundays a week apart, and subtract one second from one of them to get the last second of the previous Saturday.

Here’s the above function rewritten to return a tuple containing the start and end dates of the previous week. It can optionally be returned in epoch format, but the default is to return date objects.

def get_last_whole_week(today=None, epoch=False):
    # a date object
    date_today = today or datetime.date.today()
    print "date_today: ", date_today

    # By default day 0 is Monday. Sunday is 6.
    dow_today = date_today.weekday()
    print "dow_today: ", dow_today

    if dow_today == 6:
        days_ago_saturday = 1
    else:
        # If day between 0-5, to get last saturday, we need to go to day 0 (Monday), then two more days.
        days_ago_saturday = dow_today + 2
    print "days_ago_saturday: ", days_ago_saturday
    # Make a timedelta object so we can do date arithmetic.
    delta_saturday = datetime.timedelta(days=days_ago_saturday)
    print "delta_saturday: ", delta_saturday
    # saturday is now a date object representing last saturday
    saturday = date_today - delta_saturday
    print "saturday: ", saturday
    # timedelta object representing '6 days'...
    delta_prevsunday = datetime.timedelta(days=6)
    # Making a date object. Subtract the 6 days from saturday to get "the Sunday before that".
    prev_sunday = saturday - delta_prevsunday

    # we need to return a range starting with midnight on a Sunday, and ending w/ 23:59:59 on the
    # following Saturday... optionally in epoch format.

    if epoch:
        # saturday is date obj = 'midnight saturday'. We want the last second of the day, not the first.
        saturday_epoch = time.mktime(saturday.timetuple()) + 86399
        prev_sunday_epoch = time.mktime(prev_sunday.timetuple())
        last_week = (prev_sunday_epoch, saturday_epoch)
    else:
        saturday_str = saturday.strftime('%Y-%m-%d')
        prev_sunday_str = prev_sunday.strftime('%Y-%m-%d')
        last_week = (prev_sunday_str, saturday_str)
    return last_week

It would be easier to just have some attribute for datetime objects that lets you set the first day of the week to be Sunday instead of Monday. It wouldn’t completely alleviate every conceivable issue with calculating dates, but it would be a help. The calendar module has a setfirstweekday() method that lets you set the first weekday to whatever you want. I gather this is mostly for formatting output of matrix calendars, but it would be useful if it could be used in date calculations as well. Perhaps I’ve missed something? Clues welcome.

Mission 2: Calculate the Prior Month’s Start and End Dates

This should be easy. What I hoped would happen is I’d be able to get today’s date, and then create a timedelta object for ‘1 month’, and subtract, having Python take care of things like changing the year when the current month is January. Calculating this yourself is a little messy: you can’t just use “30 days” or “31 days” as the length of a month, because:

  1. “January 31″ – “30 days” = “January 1″ — not the previous month.
  2. “March 1″ – “31 days” = “January 30″ — also not the previous month.

Instead, what I did was this:

  1. create a datetime object for the first day of the current month (hard coding the ‘day’ argument)
  2. used a timedelta object to subtract a day, which gives me a datetime object for the last day of the prior month (with year changed for me if needed),
  3. used that object to create a datetime object for the first day of the prior month (again hardcoding the ‘day’ argument)

Here’s some code:

today = datetime.datetime.today()
first_day_current = datetime.datetime(today.year, today.month, 1)
last_day_previous = first_day_current - datetime.timedelta(days=1)
first_day_previous = datetime.datetime(last_day_previous.year, last_day_previous.month, 1)
print 'Today: ', today
print 'First day of this month: ', first_day_current
print 'Last day of last month: ', last_day_previous
print 'First day of last month: ', first_day_previous

This outputs:

Today:  2010-07-06 09:57:33.066446
First day of this month:  2010-07-01 00:00:00
Last day of last month:  2010-06-30 00:00:00
First day of last month:  2010-06-01 00:00:00

Not nearly as onerous as the week start/end range calculations, but I kind of thought that between all of these modules we have that one of them would be able to find me the start and end of the previous month. The raw material for creating this is, I suspect, buried somewhere in the source code for the calendar module, which can tell you the start and end dates for a month, but can’t do any date calculations to give you the previous month. The datetime module can do calculation, but it can’t tell you the start and end dates for a month. The datetime.timedelta object’s largest granularity is ‘week’ if memory serves, so you can’t just do ‘timedelta(months=1)’, because the deltas are all converted internally to a fixed number of days, seconds, or milliseconds, and a month isn’t a fixed number of any of them.

Converge!

While I could probably go ahead and use dateutil, which is really darn flexible, I’d rather be able to do this without a third-party module. Also, dateutil’s flexibility is not without it’s complexity, either. It’s not an insurmountable task to learn, but it’s not like you can directly transfer your experience with the built-in modules to using dateutil.

I don’t think merging all of the time-related modules in Python would be necessary or even desirable, really, but I haven’t thought deeply about it. Perhaps a single module could provide a superclass for the various time-related objects currently spread across three modules, and they could share some base level functionality. Hard to conceive of a timedelta object not floating alone in space in that context, but alas, I’m thinking out loud. Perhaps a dive into the code is in order.

What have you had trouble doing with dates and times in Python? What docs have I missed? What features are completely missing from Python in terms of time manipulation that would actually be useful enough to warrant inclusion in the collection of included batteries? Let me know your thoughts.

Brain Fried Over NoSQL

Saturday, June 26th, 2010

So, I’m working on a pet project. It’s in stealth mode. Just kidding — I don’t believe in stealth mode 😉

It’s a twitter analytics dashboard that actually does useful things with the mountains of data available from the various Twitter APIs. I’m writing it in Python using Tornado. Here’s the first mockup I ever did for it, just like 2 nights ago:

It’s already a lot of fun. I’ve worked with Tornado before and like it a lot. I have most of the base infrastructure questions answered, because this is a pet project and they’re mostly easy and in some sense “don’t matter”. But that’s what has me stuck.

It Doesn’t Matter

It’s true. Past a certain point, belaboring choices of what tools to use where is pointless and is probably premature optimization. I’ve been working with startups for the past few years, and I’m painfully aware of what happens when a company takes too long to react to their popularity. I want to architect around that at the start, but I’m resisting. It’s a pet project.

But if it doesn’t matter, that means I can choose tools that are going to be fun to dig into and learn about. I’ve been so busy writing code to help avoid or buffer impact to the database that I haven’t played a whole lot with the NoSQL choices out there, and there are tons of them. And they all have a different world view and a unique approach to providing solutions to what I see as somewhat different problems.

Why NoSQL?

Why not? I’ve been working with relational database systems since 1998. I worked on large data reporting projects, a couple of huge data warehousing projects, financial transaction systems, I worked for Sybase as a consulting DBA and project manager for a while, I was into MySQL and PostgreSQL by 2000, used them in production environments starting around 2001-02… I understand them fairly well. I also understand BDB and other “flat-file” databases and object stores. SQLite has become unavoidable in the past few years as well. It’s not like I don’t understand the compromises I’m making going to a NoSQL system.

There’s a good bit of talk from the RDBMS camp (seriously, why do they need their own camp?) about why NoSQL is bad. Lots of people who know me  would put me in the RDBMS camp, and I’m telling you not to cry yourself to sleep out of guilt over a desire to get to know these systems. They’re interesting, and they solve some huge issues surrounding scalability with greater ease than an RDBMS.

Like what? Well, cost for one. If I could afford Oracle I’d sooner use that than go NoSQL in all likelihood. I can’t afford it. Not even close. Oracle might as well charge me a small planet for their product. It’s great stuff, but out of reach. And what about sharding? Sharding a relational database sucks, and to try to hide the fact that it sucks requires you to pile on all kinds of other crap like query proxies, pools, and replication engines, all in an effort to make this beast do something it wasn’t meant to do: scale beyond a single box. All this stuff also attempts to mask the reality that you’ve also thrown your hands in the air with respect to at least 2 letters that make up the ACID acronym. What’s an RDBMS buying you at that point? Complexity.

And there’s another cost, by the way: no startup I know has the kind of enormous hardware that an enterprise has. They have access to commodity hardware. Pizza boxes. Don’t even get me started on storage. I’ve yet to see SSD or flash storage at a startup. I currently work at MyYearbook.com, and there are some pretty hefty database servers there, but it can hardly be called a startup anymore. Hell, they’re even profitable! 😉

Where Do I Start?

One nice thing about relationland is I know the landscape pretty well. Going to NoSQL is like dropping me in a country I’ve never heard of where I don’t really speak the language. I have some familiarity with key-value stores from dealing with BDB and Memcache, and I’ve played with MongoDB a bit (using pymongo), but that’s just the tip of the iceberg.

I heard my boss mention Tokyo Tyrant a few times, so I looked into it. It seems to be one of the more obscure solutions out there from the standpoint of adoption, community, documentation, etc., but it does appear to be very capable on a technical level. However, my application is going to be number-heavy, and I’m not going to need to own all of the data required to provide the service. I can probably get away with just incrementing counters in Memcache for some of this work. For persistence I need something that will let me do aggregation *FAST* without having to create aggregation tables, ideally. Using a key/value store for counters really just seems like a no-brainer.

That said, I think what I’ve decided to do, since it doesn’t matter, is punt on this decision in favor of getting a working application up quickly.

MySQL

Yup. I’m going to pick one or two features of the application to implement as a ‘first cut’, and back them with a MySQL database. I know it well, Tornado has a built-in interface for it, and it’s not going to be a permanent part of the infrastructure (otherwise I’d choose PostgreSQL in all likelihood).

To be honest, I don’t think the challenge in bringing this application to life are really related to the data model or the engine/interface used to access it (though if I’m lucky that’ll be a major part of keeping it alive). No, the real problem I’m faced with is completely unrelated to these considerations…

Twitter’s API Service

Not the API itself, per se, but the service providing access to it, and the way it’s administered, is going to be a huge challenge. It’s not just the Twitter website that’s inconsistent, the API service goes right along. Not only that, but the type of data I really need to make this application useful isn’t immediately available from the API as far as I can tell.

Twitter maintains rate limits on the API. You can only make so many calls over so short a period of time. That alone makes providing an application like this to a lot of people a bit of a challenge. Compounding the issue is that, when there are failwhales washing up on the shores, those limits can be dynamically decreased. Ugh.

I guess it’s not a project for the faint of heart, but it’ll drive home some golden rules that are easy to neglect in other projects, like planning for failure (of both my application, and Twitter). Also, it’ll be a lot of fun.

Why Open Shop In California?

Thursday, June 3rd, 2010

DISCLAIMER: I live on the East Coast, so these are perceptions and opinions that I don’t put forth as facts. I’m more asking a question to start a dialog than professing knowledge.

So, I just heard a report claiming that there are more IT jobs than techs to fill them in Southern California. Anyone who ever reads a tech job board and/or TechCrunch has also no doubt taken note that a vast majority of startups seem to be starting up there, and that there are just a metric asston of jobs there anyway.

This boggles my mind. This is a place with an extremely high cost of living, making labor more expensive. At the same time, aren’t there rolling power outages in CA? Does that not effect corporations or something? Do they just move their datacenters across the border to another state?

Between what I would think is an amazingly high labor cost and what I would think is an unfavorable place in terms of simple things like availability of power, I would think more places would look elsewhere for expansion or startups.

I live within spitting distance of at least 5 universities with engineering departments that I think would rate at the very least “solid”, many would rate better. I would guess that I could get to any Ivy League school in 6 hours or less, driving (3 are within an hour of my NJ home). MIT and Stevens are very good non-Ivy schools, and lots of other ones like Rutgers, NJIT, Penn State, NYU, and lots more are here, and those are just a few of the ones between NYC and Philadelphia, which are less than 2 hours apart. So…. there’s a labor pool here.

Is it tax breaks? Some aspect of the political atmosphere? Transportation? Is San Francisco such a clean, safe, friendly city that you just deal with the nonsense to live there?

What’s your take on this?

Per-machine Bash History

Monday, May 10th, 2010

I do work on a lot of machines no matter what environment I’m working in, and a lot of the time each machine has a specific purpose. One thing that really annoys me when I work in an environment with NFS-mounted home directories is that if I log into a machine I haven’t used in some time, none of the history specific to that machine is around anymore.

If I had a separate ~/.bash_history file on each machine, this would likely solve the problem. It’s pretty simple to do as it turns out. Just add the following lines to ~/.bashrc:

srvr=`hostname`
export HISTFILE="/home/jonesy/.bash_history_${srvr}"

Don’t be alarmed when you source ~/.bashrc and you don’t see the file appear in your home directory. Unless you’ve configured things otherwise, history is only written at the end of a bash session. So go ahead and source bashrc, run a few commands, end your session, log back in, and the file should be there.

I’m not actually sure if this is going to be a great idea for everyone. If you work in an environment where you run the same commands from machine to machine, it might be better to just leave things alone. For me, I’m running different psql/mysql connection commands and stuff like that which differ depending on the machine I’m on and the connection perms it has.

Tornado’s Big Feature is Not ‘Async’

Sunday, April 4th, 2010

I’ve been working with the Tornado web server pretty much since its release by the Facebook people several months ago. If you’ve never heard of it, it’s a sort of hybrid Python web framework and web server. On the framework side of the equation, Tornado has almost nothing. It’s completely bare bones when compared to something like Django. On the web server side, it is also pretty bare bones in terms of hardcore features like Apache’s ability to be a proxy and set up virtual hosts and all of that stuff. It does have some good performance numbers though, and the feature that seems to drive people to Tornado seems to be that it’s asynchronous, and pretty fast.

I think some people come away from their initial experiences with Tornado a little disheartened because only upon trying to benchmark their first real app do they come face to face with the reality of “asynchronous”: Tornado can be the best async framework out there, but the minute you need to talk to a resource for which there is no async driver, guess what? No async.

Some people might even leave the ring at this point, and that’s a shame, because to me the async features in Tornado aren’t what attract me to it at all.

Why Tornado, if Not For Async?

For me, there’s an enormous win in going with Tornado (or other things like it), and to get this benefit I’m willing to deal with some of Tornado’s warts and quirks. I’m willing to deal with the fact that the framework provides almost nothing I’m used to having after being completely spoiled by Django. What’s this magical feature you ask? It’s simply the knowledge that, in Tornado-land, there’s no such thing as mod_wsgi. And no mod_python either. There’s no mod_anything.

This means I don’t have to think about sys.path, relative vs. absolute paths, whether to use daemon or embedded mode, “Cannot be loaded as Python module” errors, “No such module” errors, permissions issues, subtle differences between Django’s dev server and Apache/mod_wsgi, reconciling all of these things when using/not using virtualenv, etc. It means I don’t have to metascript my way into a working application. I write the app. I run the app.

Wanna see how to create a Tornado app? Here’s one right here:

import tornado.httpserver
import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("This is a Tornado app")

application = tornado.web.Application([
    (r"/", MainHandler),
])

if __name__ == "__main__":
    http_server = tornado.httpserver.HTTPServer(application)
    http_server.listen(8888)
    tornado.ioloop.IOLoop.instance().start()

Save this to whatever file you want, run it, and do ‘curl http://localhost:8888′ and you’ll see ‘This is a Tornado app’ on your console.

Simplistic? Yes, absolutely. But when you can just run this script, put it behind nginx, and have it working in under five minutes, you dig a little deeper and see what else you can do with this thing. Turns out, you can do quite a bit.

Can I Do Real Work With This?

I’ve actually been involved in a production launch of a non-trivial service running on Tornado, and it was mind-numbingly easy. It was several thousand lines of Python, all of which was written by two people, and the prototype was up and running inside of a month. Moving from prototype to production was a breeze, and the site has been solid since its launch a few months ago.

Do You Miss Django?

I miss *lots* of things about Django, sure. Most of all I miss Django’s documentation, but Tornado is *so* small that you actually can find what you need in the source code in 2 minutes or less, and since there aren’t a ton of moving parts, when you find what you’re looking for, you just read a few lines and you’re done: you’re not going to be backtracking across a bunch of files to figure out the process flow.

I also miss a lot of what I call Django’s ‘magic’. It sure does a lot to abstract away a lot of work. In place of that work, though, you’re forced to take on a learning curve that is steeper than most. I think it’s worth getting to know Django if you’re a web developer who hasn’t seen it before, because you’ll learn a lot about Python and how to architect a framework by digging in and getting your hands dirty. I’ve read seemingly most books about Django, and have done some development work in Django as well. I love it, but not for the ease of deployment.

I spent more time learning how to do really simple things with Django than it took to:

  1. Discover Tornado
  2. Download/install and run ‘hello world’
  3. Get a non-trivial, commercial application production-ready and launch it.

Deadlines, indeed!

Will You Still Work With (Django/Mingus/Pinax/Coltrane/Satchmo/etc)?

Sure. I’d rather not host it, but if I have to I’ll get by. These applications are all important, and I do like developing with them. It’s mainly deployment that I have issues with.

That’s not to say I wouldn’t like to see a more mature framework made available for Tornado either. I’ve worked on one, though it’s not really beyond the “app template” phase at this point. Once the app template is able to get out of its own way, I think more features will start to be added more quickly… but I digress.

In the end, the astute reader will note that my issue isn’t so much with Django-like frameworks (though I’ll note that they don’t suit every purpose), but rather with the current trend of using mod_wsgi for deployment. I’ll stop short of bashing mod_wsgi, because it too is an important project that has done wonders for the state of Python in web development. It really does *not* fit my brain at all, though, and I find when I step into a project that’s using it and it has mod_wsgi-related problems, identifying and fixing those problems is typically not a simple and straightforward affair.

So, if you’re like me and really want to develop on the web with Python, but mod_wsgi eludes you or just doesn’t fit your brain, I can recommend Tornado. It’s not perfect, and it doesn’t provide the breadth of features that Django does, but you can probably get most of your work done with it in the time it took you to get a mod_wsgi “Hello World!” app to not return a 500 error.

PyTPMOTW: py-amqplib

Saturday, April 3rd, 2010

What’s This Module For?

To interact with a queue broker implementing version 0.8 of the Advanced Message Queueing Protocol (AMQP) standard. Copies of various versions of the specification can be found here. At time of writing, 0.10 is the latest version of the spec, but it seems that many popular implementations used in production environments today are still using 0.8, presumably awaiting a finalization of v.1.0 of the spec, which is a work in progress.

What is AMQP?

AMQP is a queuing/messaging protocol that is implemented by server daemons (called ‘brokers’) like RabbitMQ, ActiveMQ, Apache Qpid, Red Hat Enterprise MRG, and OpenAMQ. Though messaging protocols used in the enterprise are historically proprietary, AMQP has a bold and vocal stance that AMQP will be:

  • Broadly applicable for enterprise use
  • Totally open
  • Platform agnostic
  • Interoperable

The working group consists of several huge enterprises who have a vested interest in a protocol that meets these requirements. Most are either huge enterprises who are (or were) a victim of the proprietary lock-in that came with what will now likely become ‘legacy’ protocols, or implementers of the protocols, who will sell products and services around their implementation. Here’s a brief list of those involved in the AMQP working group:

  • JPMorgan Chase (the initial developers of the protocol, along with iMatix)
  • Goldman Sachs
  • Red Hat Software
  • Cisco Systems
  • Novell

Message brokers can facilitate an awfully large amount of flexibility in an architecture. They can be used to integrate applications across platforms and languages, enable asynchronous operations for web front ends, modularize and more easily distribute complex processing operations.

Basic Publishing

The first thing to know is that when you code against an AMQP broker, you’re dealing with a hierarchy: a ‘vhost’ contains one or more ‘exchanges’ which themselves can be bound to one or more ‘queues’. Here’s how you can programmatically create an exchange and queue, bind them together, and publish a message:

from amqplib import client_0_8 as amqp

conn = amqp.Connection(userid='guest', password='guest', host='localhost', virtual_host='/', ssl=False)

# Create a channel object, queue, exchange, and binding.
chan = conn.channel()
chan.queue_declare('myqueue', durable=True)
chan.exchange_declare('myexchange', type='direct', durable=True)
chan.queue_bind('myqueue', 'myexchange', routing_key='myq.myx')

# Create an AMQP message object

msg = amqp.Message('This is a test message')
chan.basic_publish(msg, 'myexchange', 'myq.myx')

As far as we know, we have one exchange and one queue on our server right now, and if that’s the case, then technically the routing key I’ve used isn’t required. However, I strongly suggest that you always use a routing key to avoid really odd (and implementation-specific) behavior like getting multiple copies of a message on the consumer side of the equation, or getting odd exceptions from the server. The routing key can be arbitrary text like I’ve used above, or you can use a common formula of using ‘.’ as your routing key. Just remember that without the routing key, the minute more than one queue is bound to an exchange, the exchange has no way of knowing which queue to route a message to. Remeber: you don’t publish to a queue, you publish to an exchange and tell it which queue it goes in via the routing key.

Basic Consumption

Now that we’ve published a message, how do we get our hands on it? There are two methods: basic_get, which will ‘get’ a single message from the queue, or ‘basic_consume’, which technically doesn’t get *any* messages: it registers a handler with the server and tells it to send messages along as they arrive, which is great for high-volume messaging operations.

Here’s the ‘basic_get’ version of a client to grab the message we just published:

msg = chan.basic_get(queue='myqueue', no_ack=False)
chan.basic_ack(msg.delivery_tag)

In the above, I’ve used the same channel I used to publish the message to get it back again using the basic_get operation. I then acknowledged receipt of the message by sending the server a ‘basic_ack’, passing along the delivery_tag the server included as part of the incoming message.

Consuming Mass Quantities

Using basic_consume takes a little more thought than basic_get, because basic_consume does nothing more than register a method with the server to tell it to start sending messages down the pipe. Once that’s done, however, it’s up to you to do a chan.wait() to wait for messages to show up, and find some elegant way of breaking out of this wait() operation. I’ve seen and used different techniques myself, and the right thing will depend on the application.

The basic_consume method also requires a callback method which is called for each incoming message, and is passed the amqp.Message object when it arrives.

Here’s a bit of code that defines a callback method, calls basic_consume, and does a chan.wait():

consumer_tag = 'foo'
def process(msg):
   txt = msg.body
   if '-1' in txt:
      print 'Got -1'
      chan.basic_cancel(consumer_tag)
      chan.close()
   else: 
      print 'Got message!'

chan.basic_consume('messages', callback=process, consumer_tag=consumer_tag)
while True:
   print 'Message processed. Next?'
   try:
      chan.wait()
   except IOError as out:
      print "Got an IOError: %s" % out
      break
   if not chan.is_open:
      print "Done processing. Later"
      break

So, basic_consume tells the server ‘Start sending any and all messages!’. The server registers a method with a name given by the consumer_tag argument, or it assigns one and it becomes the return value of basic_consume(). I define one here because I don’t want to run into race conditions where I want to call basic_cancel() with a consumer_tag variable that doesn’t exist yet, or is out of scope, or whatever. In the callback, I look for a sentinel message whose body contains ‘-1′, and at that point I call basic_cancel (passing in the consumer_tag so the server knows who to stop sending messages to), and I close the channel. In the ‘while True’, the channel object checks its status and exits if it’s not open.

The above example starts to uncover some issues with py-amqplib. It’s not clear how errors coming back from the server are handled, as opposed to errors caused by the processing code, for example. It’s also a little clumsy trying to determine the logic for breaking out of the loop. In this case there’s a sentinel message sent to the queue representing the final message on the stack, at which point our ‘process()’ callback closes the channel, but then the channel has to check its own status to move forward. Just returning False from process() doesn’t break out of the while loop, because it’s not looking for a return value from that function. We could have our process() function raise an error of its own as well, which might be a bit more elegant, if also a bit more work.

Moving Ahead

What I’ve covered here actually covers perhaps 90% of the common cases for amqplib, but there’s plenty more you can do with it. There are various exchange types, including fanout exchanges and topic exchanges, which can facilitate more interesting messaging and pub/sub models. To learn more about them, here are a couple of places to go for information:

Broadcasting your logs with RabbitMQ and Python
Rabbits and Warrens
RabbitMQ FAQ section “Messaging Concepts: Exchanges

Quick Loghetti Update

Monday, March 15th, 2010

For the familiar and impatient: Loghetti has moved to github and has been updated. An official release hasn’t been made yet, but cloning the repository and installing argparse will result in perfectly usable code. More on the way.

For the uninitiated, Loghetti is a command line log sifting/reporting tool written in Python to parse Apache Combined Format log files. It was initially released in late 2008 on Google Code. I used loghetti for my own work, which involved sifting log files with tens of millions of lines. Needless to say, it needed to be reasonably fast, and give me a decent amount of control over the data returned. It also had to be easy to use; just because it’s fast doesn’t mean I want to retype my command because of confusing options or the like.

So, loghetti is reasonably fast, and reasonably easy, and gives a reasonable amount of control to the end user. It’s certainly a heckuva lot easier than writing regular expressions into ‘grep’ and doing the ol’ ‘press & pray’.

Loghetti suffered a bit over the last several months because one of its dependencies broke backward compatibility with earlier releases. Such is the nature of development. Last night I finally got to crack open the code for loghetti again, and was able to put a solution together in an hour or so, which surprised me.

I was able to completely replace Doug Hellmann’s CommandLineApp with argparse very, very quickly. Of course, CommandLineApp was taking on responsibility for actually running the app itself (the main loghetti class was a subclass of CommandLineApp), and was dealing with the options, error handling, and all that jazz. It’s also wonderfully generic, and is written so that pretty much any app, regardless of the type of options it takes, could run as a CommandLineApp.

argparse was not a fast friend of mine. I stumbled a little over whether I should just update the namespace of my main class via argparse, or if I should pass in the Namespace object, or… something else. Eventually, I got what I needed, and not much more.

So loghetti now requires argparse, which is not part of the standard library, so why replace what I knew with some other (foreign) library? Because argparse is, as I understand it, slated for inclusion in Python 3, at which point optparse will be deprecated.

So, head on over to the GitHub repo, give it a spin, and send your pull requests and patches. Let the games begin!

If You Don’t Date Your Work, It Sucks.

Monday, January 18th, 2010

I probably get more upset than is reasonable when I come across articles with no date on them. I scroll furiously for a few minutes, try to see if the date was put in some stupid place like the fine print written in almost-white-on-white at the bottom of the post surrounded by ads. Then I skim the article looking for references to software versions that might clue me in on how old this material is. Then I check the sidebars to see if there’s some kind of “About this Post” block. Finally, I make a mental note of the domain in a little mental list I use to further filter my Google searches in the future. Then I close the browser window in disgust. If it weren’t completely gross and socially unacceptable to do so, I would spit on the floor every time this happened.

Why would you NOT date your articles? In almost every single theme for every single content management solution written in any language and backed by any database, “Date” is a default element. Why would you remove it? It is almost guaranteed to be more work to remove it. Why would you go through actual work to make your own writing less useful to others?

What happens when you don’t date your articles?

  1. People have no idea whether your article has anything to do with what they’re working on.  If you wrote an article about the Linux kernel in 1996, it’s of no use to me *now*, even if it was pretty hardcore at the time.
  2. Readers are forced to skim your article looking for references to software versions to see if your article is actually meaningful to them or not. Why make it hard for people to know whether your article is useful? The only reason I can think of is that you already know your articles are old, so not dating them insures that people at least skim enough to see some of the ads on your site. You are irreversibly lame if you do this.
  3. It causes near seizures in people like me who really hate when you don’t date your work, as well as all of your past teachers, who no doubt demanded that you sign and date your work.
  4. Every time you don’t date an article online, a seal pup is clubbed to death in the arctic, and a polar bear gets stranded on a piece of ice.

At some point, I will make an actual list of web sites that regularly do not date their work. A sort of hall of shame for sites that fail to link their writing to some kind of time-based context. If you have sites you’d like to add, let me know in the comments.

Head first into javascript (and jQuery)

Tuesday, January 12th, 2010

So, I had to take a break from doing the Code Katas just as I was getting to the really cool one about Bloom Filters. The reason for the unexpected break from kata-ing was that I had a project thrown into my lap. I say “project” not because it was some huge thing that needed doing — lots of people reading this could probably have done it in a matter of a few hours — but because it involved two things I’ve never done any real work with: javascript, and jQuery.

My task? Well, first I had to recreate a page from a graphic designer’s mockup. So, take a JPEG image and create the CSS and stuff to make that happen. Already I’m out of my comfort zone, because historically I’m a back-end developer more comfortable with threading than CSS (95% of my code is written in Python and is daemonized/threaded/multiprocess/all-of-the-above or worse), but at least I’ve done enough CSS to get by.

Once the CSS was done, I was informed that I’d now need to take the tabular reporting table I just created and make it sort on any of the columns, get the data via AJAX calls to a flat file that would store the JSON for the dummy data, create nice drop-down date pickers so date ranges in the report could be chosen by the end user, page the data using a flickr-style pager so only 20 lines would show up on a page, and alternate the row colors in the table (and make sure that doesn’t break when you sort!).

How to learn javascript and/or jQuery REALLY fast

How exactly do you learn enough javascript and jQuery to get this done in a matter of a few days (counting from *after* the CSS part was done)? Here are some links you should keep handy if you have a situation like this arise:

  • If Douglas Crockford says it, listen. I’d advise you start here (part I of a 4-part intro to javascript). That site also has his ‘Advanced Javascript’ series. He also wrote a book, which is small enough to read quickly, and well done.
  • Packt has a lot of decent resources for jQuery. Specifically, this article helped me organize what I was doing in my head. The code itself had some rather glaring issues — you’re not going to cut-n-paste this code and deploy it to production, but coming from scorched earth, I really learned a lot.
  • After the project was already over I found this nice writeup that covers quick code snippets and demos illustrating some niceties like sliding panels and disappearing table rows and how to do them with jQuery.
  • jQuery itself has some pretty decent documentation for those times when your cut-n-pasted code looks a little suspect or you’re just sure there’s a better way. Easy to read and concise.

Why I Wrote My Own Sorting/Paging in jQuery

Inevitably, someone out there is wondering why I didn’t just use tablesorter and tablesorter.pager, or Flexigrid, or something like that. The answer, in a nutshell, is paging. Sorting and paging operations, I learned both by experience and in my reading, *NEED* to know about each other. If they don’t, you’ll get lots of weird results, like sorting on just one page (or, sorting on just one page until you click to another page, which will look as expected, and then click back), or pages with rows on them that are just plain wrong, or… the list goes on. This is precisely the problem that the integrated “all-sorting-all-paging” tools like tablesorter try to solve. The issue is that I could not find a SINGLE ONE that did not have a narrow definition of what a pager was, and what it looked like.

I wanted (well, I was required to mimic the mockup, so “needed”) a flickr-style pager — modified. I needed to have each page of the report represented at the bottom of the report table by a block with the proper number in the block. The block would be clickable, and clicking it would show the corresponding page of data. This is more or less what Flickr does, but I didn’t need the “previous” and “next” buttons, and I didn’t need the “…” they use (rather effectively) to cut down on the number of required pager elements. So… just some blocks with page numbers. That’s it.

I started out using tablesorter for jQuery, and it worked great — it does the sorting for you, manages the alternating row colors, and is a pretty flexible sorter. Then I got to the paging part, and things went South pretty fast. While tablesorter.pager has a ‘moveToPage’ function, it’s not exposed so you can bind it to a CSS selector like the ‘moveToPrevious’, ‘moveToLast’, ‘moveToNext’ and other functions are. So, I tried to hack it into the pager code myself. I got weird results (though I feel more confident about approaching that now than I did even three days ago). There wasn’t any obvious way to do anything but give the user *only* first/last/previous/next buttons to control the paging. I moved on. I googled, I asked on jQuery IRC, I even wrote the developer of tablesorter. I got nothing.

I looked at 4 or 5 different tools, and was shocked to find the same thing! I didn’t go digging into the code of all of them, but their documentation all seemed to be in some kind of weird denial about the existence of flickr-style paging altogether!

So, I wrote my own. It wasn’t all that difficult, really. The code that worked was only slightly different from the code I’d fought with early on in the process. It just took some reading to get some of the basic tricks of the trade under my belt, and I got a tip or two from one of the gurus at work as well, and I was off to the races!

Lessons Learned

So, one thing I have to say for my boss is that he knows better than to throw *all* of those things at me at once. Had he come to me and said he wanted an uber-ajaxian reporting interface from outer space from the get-go, I might not have responded even as positively as I did (and I would rate my response as ‘tepid, but attempting a positive outlook’) . It’s best to draw me in slowly, a task at a time, so I can get some sense of accomplishment and some feedback along the way instead of feeling like I still have this mountain to climb before it’s over.

I certainly learned that this javascript and jQuery (and AJAX) stuff isn’t really black magic. Once you get your hands dirty with it it’s kinda fun. I still don’t ever want to become a front end developer on a full-time basis (browser testing is where I *really* have zero patience, either for myself or the browsers), but this experience will serve me well in making my own projects a little prettier and slicker, and nicer to use. It’ll also help me understand more about what the front end folks are dealing with, since there’s tons of javascript at myYearbook.

So, I hope this post keeps some back end scalability engineer’s face from turning white when they’re given a front-end AJAX project to do. If you’ve ever had a similar situation happen to you (not necessarily related to javascript, but other technologies you didn’t know until you were thrown into a project), let’s hear the war stories!!

2009 Python Meme

Tuesday, December 29th, 2009

Heard about this from Tarek, and you can find more of them on Planet Python (where I found Tarek’s post).

  1. What’s the coolest Python application, framework or library you have discovered in 2009?
  2. Probably Tornado. Tornado is an interesting application, because it blurs the line a bit between a framework like Django and a traditional web server. If you can picture it, it’s a barebones, lightweight, almost overly simplified Django, with a production-ready web server instead of Django’s built-in dev server. In reality, Tornado is (or feels) more integrated than that, but that leads to some interesting issues on its own.

    Still, it’s been a heckuva lot of fun to play with. One thing that always concerned me about Django was the ORM. It’s fine for my little hobby website, or a simple wiki for my wife and I to use, and even some slightly more complex applications, but if you have a database-driven site that serves “lots and lots” of users and it needs to manage complex relationships and never slow down…. I don’t trust the ORM to do that. What’s more, I’m actually pretty skilled in data modeling, database administration, etc., and I understand abstraction. I don’t really require Django’s models (though, again, I love Django for doing low-traffic sites very quickly).

    Playing with web frameworks is a lot of fun, and if you’ve played with a few, you’ll like the “clean slate” that Tornado provides you to mix-n-match your favorite features of the frameworks you’ve used. I’ve done some hacking around Tornado to provide some generic facilities I’m likely to use in just about every project I use Tornado for. The sort of pseudo-framework is available as Tornado-project-stub on github.

  3. What new programming technique did you learn in 2009 ?
  4. Thread and process pool management. Whereas in previous roles I focused on optimization at the system and network level by testing/deploying new tools, poking at new paradigms, or just reworking/overhauling things that were modeled or configured suboptimally, my new role is something you really would call “scalability engineering”. I believe everything I’m involved in at the moment involves the words “distributed”, “asynchronous”, “multithreaded”, “multiprocess”, and other terms that imply “fast” (for varying definitions of fast, depending on the project).

    Though I’ve had to understand threading and multiprocessing (and microthreads and other stuff too) in the past, and I’ve even written simple threaded/multiprocessing stuff in the past, I’m now knee deep in it, and am getting into more complex scenarios where I really *need* a pool manager, and would really *like* to pass objects around on the network. Happily, I’m finding that Python has facilities for all of this built-in.

    Aside from that, I have to say that while most of what I’m doing now doesn’t involve techniques I’ve never heard of, I’m really reveling in the opportunity to put them into practice and actually use them. Also, since I now code full-time, I find the ability to code doesn’t ever escape my brain. I can code fast enough now that I can implement something two or three different ways to compare the solutions in no time!

  5. What’s the name of the open source project you contributed the most in 2009 ? What did you do ?
  6. Actually, it’s not released yet, but I’ve almost completed a rewrite of Golconde, a queue-based replication engine. I was able to make the queuing protocol, the message processor (the thing that takes queued messages and turns them into database operations), and the database backend swappable. Golconde was written to support STOMP queuing and PostgreSQL specifically. I’ve already used STOMP and AMQP interchangeably in the rewrite, and I’m now on to swapping out the database and message processor components, documenting how users can create their own plugins for these bits along the way.

    Golconde is also threaded. My rewrite is currently threaded as well, but I’m going to change out the threads in favor of processes, because the facilities that can help the project moving forward (in the short term, even) come more from multiprocessing than from threading. One thing I’ve already accomplished is refactoring so that there need be only a single thread class, which makes using worker pools more convenient and natural. It’s coming together!

  7. What was the Python blog or website you read the most in 2009 ?
  8. I read Planet Python every day, and keep up with Python Reddit as well. Besides aggregators, Corey Goldberg and Jesse Noller seem to overlap my areas of interest a lot, so I find myself actually showing up at their blogs pretty often. Neither of them blog enough, imo 😉
  9. What are the three top things you want to learn in 2010?
    1. Nose – because I want to become so good at testing that it makes me more productive, not less. Right now I’m just clumsy at testing, and I come across situations that I just plain don’t know how to write a test for. I need to know more about writing mock objects and how to write tests for threaded/multiprocessing applications. I know enough to think that Nose is probably the way to go (I’ve used both unittest and doctest before, so I’m not totally ‘green’ to the notion of testing in general), but I haven’t been able to work it into my development process yet.
    2. Erlang – There doesn’t seem to be a language that makes concurrency quite as brainless as Erlang. That said, learning the language and OTP platform is *not* brainless.
    3. Sphinx – I hear all the cool kids are using it. Some people whose opinions I trust love it, but I have some reservations based on my experience with it. The one thing that gives me hope is that Django’s documentation site, which I like the interface and features of, uses it.