Category: Scripting

Wonky Bunny Issue “Fixed”

By m0j0, July 19, 2010 7:31 am

For those who don’t know what the headline means:

  1. Bunny is an open source command line utility written in Python that provides a shell for talking to and testing AMQP brokers (tested on RabbitMQ).
  2. AMQP is a queuing protocol. It’s defined as a binary wire-level protocol as well as a command set. The spec also defines a good portion of the server semantics, so by that logic Bunny should work against other AMQP brokers besides RabbitMQ
  3. RabbitMQ is written in Erlang atop OTP, so clustering is ‘free and easy’. My experience with RabbitMQ so far has been fantastic, though I’d like to see client libraries in general mature a bit further.

So, Bunny had this really odd quirk upon its first release. If you did something to cause an error that resulted in a connection being dropped, bunny wouldn’t trap the error. It would patiently wait for you to enter the next command, and fail miserably. The kicker is that I actually defined a ‘check_conn’ method to make sure that the connection was alive before doing anything else, and that really wasn’t working.

The reason is because py-amqplib (or, perhaps, its interpretation of the AMQP spec, which defines a Connection class), implements a high-level Connection class, along with a Channel class (also defined in the spec), which is what seems to actually map to what you and I as users actually care about: some “thing” that lets us communicate with the server, and without which we can’t talk to the server.

With py-amqplib, a Connection is actually defined as a channel 0, and always channel 0. I gather that channel 0 gets some special treatment in other sections of the library code, and the object that lives at index ’0′ in Connection.channels is actually defined as a Connection object, whereas others are Channel objects.

The result of all of this is that creating a channel in my code and then checking my own object’s ‘chan’ attribute is useless because channels can be dropped on the floor in py-amqplib, and the only way I can tell to figure that out is to check the connection object’s ‘channels’ dictionary. So that’s what I do now. It seems to be working well.

Not only does bunny now figure out that your connection is gone, but it’ll also attempt a reconnect using the credentials you gave it in the last ‘connect’ command. You see, bunny extends the Python built-in cmd.Cmd object, which lets me define my whole program as a single class. That means that whatever you type in, like the credentials to the ‘connect’ command, can be kept handy, since the lifetime of the instance of this class is the same as the lifetime of a bunny session.

So, in summary, bunny is more useful now, but it’s still not “done”. I made this fix over the weekend during an hour I unexpectedly found for myself. It’s “a” solution, but it’s not “the” solution. The real solution is to map out all of the errors that actually cause a connection to drop and give the user a bit more feedback about what happened. I also want to add more features (like support for getting some stats back from Alice to replace bunny’s really weak ‘qlist’ command).

Python Date Manipulation

By m0j0, July 6, 2010 9:56 am

This post is the result of some head-scratching and note taking I did for a reporting project I undertook recently. It’s not a complete rundown of Python date manipulation, but hopefully the post (and hopefully the comments) will help you and maybe me too :)

The head-scratching is related to the fact that there are several different time-related objects, spread out over a few different time-related modules in Python, and I have found myself in plenty of instances where I needed to mix and match various methods and objects from different modules to get what I needed (which I thought was pretty simple at first glance). Here are a few nits to get started with:

  • strftime/strptime can generate the “day of week” where Sunday is 0, but there’s no way to tell any of the conversion functions like gmtime() that you want your week to start on Sunday as far as I know. I’m happy to be wrong, so leave comments if I am. It seems odd that you can do a sort of conversion like this when you output, but not within the calculation logic.
  • If you have a struct_time object in localtime format and want to convert it to an epoch date, time.mktime() works, but if your struct_time object is in UTC format, you have to use calendar.timegm() — this is lame and needs to go away. Just add timegm() to the time module (possibly renamed?).
  • time.ctime() will convert an epoch date into nicely formatted local time, but there’s no function to provide the equivalent output for UTC time.

There are too many methods and modules for dealing with date manipulation in Python, such that performing fairly common tasks requires importing and using a few different modules, different object types and methods from each. I’d love this to be cleaned up. I’d love it more if I were qualified to do it. More learning probably needs to happen for that. Anyway, just my $.02.

Mission 1: Calculating Week Start/End Dates Where Week Starts on Sunday

My mission: Pull epoch dates from a database. They were generated on a machine whose time does not use UTC, but rather local time (GMT-4).  Given the epoch date, find the start and end of the previous week, where the first day of the week is Sunday, and the last day of the week is Saturday.

So, I need to be able to get a week start/end range, from Sunday at 00:00 through Saturday at 23:59:59. My initial plan of attack was to calculate midnight of the current day, and then base my calculations for Sunday 00:00 on that, using simple timedelta(days=x) manipulations. Then I could do something like calculate the next Sunday and subtract a second to get Saturday at 23:59:59.

Nothing but ‘time’

In this iteration, I’ll try to accomplish my mission using only the ‘time’ module and some epoch math.

Seems like you should be able to easily get the epoch value for midnight of the current epoch date, and display it easily with time.ctime(). This isn’t quite true, however. See here:

>>> etime = int(time.time())
>>> time.ctime(etime)
'Thu May 20 15:26:40 2010'
>>> etime_midnight = etime - (etime % 86400)
>>> time.ctime(etime_midnight)
'Wed May 19 20:00:00 2010'
>>>

The reason this doesn’t do what you might expect is that time.ctime() in this case outputs the local time, which in this case is UTC-4 (I live near NY, USA, and we’re currently in DST. The timezone is EDT now, and EST in winter). So when you do math on the raw epoch timestamp (etime), you’re working with a bare integer that has no idea about time zones. Therefore, you have to account for that. Let’s try again:

>>> etime = int(time.time())
>>> etime
1274384049
>>> etime_midnight = (etime - (etime % 86400)) + time.altzone
>>> time.ctime(etime_midnight)
'Thu May 20 00:00:00 2010'
>>>

So, why is this necessary? It might be clearer if we throw in a call to gmtime() and also make the math bits more transparent:

>>> etime
1274384049
>>> time.ctime(etime)
'Thu May 20 15:34:09 2010'
>>> etime % 86400
70449
>>> (etime % 86400) / 3600
19
>>> time.gmtime(etime)
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=19, tm_min=34, tm_sec=9, tm_wday=3, tm_yday=140, tm_isdst=0)
>>> midnight = etime - (etime % 86400)
>>> time.gmtime(midnight)
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=140, tm_isdst=0)
>>> time.ctime(midnight)
'Wed May 19 20:00:00 2010'
>>> time.altzone
14400
>>> time.altzone / 3600
4
>>> midnight = (etime - (etime % 86400)) + time.altzone
>>> time.gmtime(midnight)
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=140, tm_isdst=0)
>>> time.ctime(midnight)
'Thu May 20 00:00:00 2010'
>>>

What’s that now? You want what? You want the epoch timestamp for the previous Sunday at midnight? Well, let’s see. The time module in Python doesn’t do deltas per se. You can calculate things out using the epoch bits and some math if you wish. The only bit that’s really missing is the day of the week our current epoch timestamp lives on.

>>> time.ctime(midnight)
'Thu May 20 00:00:00 2010'
>>> struct_midnight = time.localtime(midnight)
>>> struct_midnight
time.struct_time(tm_year=2010, tm_mon=5, tm_mday=20, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=140, tm_isdst=1)
>>> dow = struct_midnight.tm_wday
>>> dow
3
>>> midnight_sunday = midnight - ((dow + 1) * 86400)
>>> time.ctime(midnight_sunday)
'Sun May 16 00:00:00 2010'

You can do this going forward in time from the epoch time as well. Remember, we also want to grab 23:59:59 on the Saturday after the epoch timestamp you now have:

>>> saturday_night = midnight + ((5 - dow+1) * 86400) - 1
>>> time.ctime(saturday_night)
'Sat May 22 23:59:59 2010'
>>>

And that’s how you do date manipulation using *only* the time module. Elegant,no?

No. Not really.

Unfortunately, the alternatives also aren’t the most elegant in the world, imho. So let’s try doing this all another way, using the datetime module and timedelta objects.

Now with datetime!

The documentation for the datetime module says:

“While date and time arithmetic is supported, the focus of the implementation is on efficient member extraction for output formatting and manipulation.”

Hm. Sounds a lot like what the time module functions do. Some conversion here or there, but no real arithmetic support. We had to pretty much do it ourselves mucking about with epoch integer values. So what’s this buy us over the time module?

Let’s try to do our original task using the datetime module. We’re going to start with an epoch timestamp, and calculate the values for the previous Sunday at midnight, and the following Saturday at 23:59:59.

The first thing I had a hard time finding was a way to deal with the notion of a “week”. I thought I’d found it in ‘date.timetuple()’, which help(date.timetuple) says is “compatible with time.localtime()”. I guess they must mean that the output is the same as time.localtime(), because I can’t find any other way in which it is similar. Running time.localtime() with no arguments returns a time_struct object for the current time. date.timetuple() requires arguments or it’ll throw an error, and to make you extra frustrated, the arguments it takes aren’t in the docs or the help() output.

So maybe they mean it takes the same arguments as time.localtime(), eh? Not so much — time.localtime() takes an int representing an epoch timestamp. Trying to feed an int to date.timetuple throws an error saying it requires a ‘date’ object.

So, the definition of “compatible” is a little unclear to me in this context.

So here I’ve set about finding today, then “last saturday”, and then “the sunday before the last saturday”:

def get_last_whole_week(today=None):
    # a date object
    date_today = today or datetime.date.today()

    # day 0 is Monday. Sunday is 6.
    dow_today = date_today.weekday()

    if dow_today == 6:
        days_ago_saturday = 1
    else:
    # If day between 0-5, to get last saturday, we need to go to day 0 (Monday), then two more days.
        days_ago_saturday = dow_today + 2

    # Make a timedelta object so we can do date arithmetic.
    delta_saturday = datetime.timedelta(days=days_ago_saturday)

    # saturday is now a date object representing last saturday
    saturday = date_today - delta_saturday

    # timedelta object representing '6 days'...
    delta_prevsunday = datetime.timedelta(days=6)

    # Making a date object. Subtract the days from saturday to get "the Sunday before that".
    prev_sunday = saturday - delta_prevsunday

This gets me date objects representing the start and end time of my reporting range… sort of. I need them in epoch format, and I need to specifically start at midnight on Sunday and end on 23:59:59 on Saturday night. Sunday at midnight is no problem: timetuple() sets time elements to 0 anyway. For Saturday night, in epoch format, I should probably just calculate a date object for two Sundays a week apart, and subtract one second from one of them to get the last second of the previous Saturday.

Here’s the above function rewritten to return a tuple containing the start and end dates of the previous week. It can optionally be returned in epoch format, but the default is to return date objects.

def get_last_whole_week(today=None, epoch=False):
    # a date object
    date_today = today or datetime.date.today()
    print "date_today: ", date_today

    # By default day 0 is Monday. Sunday is 6.
    dow_today = date_today.weekday()
    print "dow_today: ", dow_today

    if dow_today == 6:
        days_ago_saturday = 1
    else:
        # If day between 0-5, to get last saturday, we need to go to day 0 (Monday), then two more days.
        days_ago_saturday = dow_today + 2
    print "days_ago_saturday: ", days_ago_saturday
    # Make a timedelta object so we can do date arithmetic.
    delta_saturday = datetime.timedelta(days=days_ago_saturday)
    print "delta_saturday: ", delta_saturday
    # saturday is now a date object representing last saturday
    saturday = date_today - delta_saturday
    print "saturday: ", saturday
    # timedelta object representing '6 days'...
    delta_prevsunday = datetime.timedelta(days=6)
    # Making a date object. Subtract the 6 days from saturday to get "the Sunday before that".
    prev_sunday = saturday - delta_prevsunday

    # we need to return a range starting with midnight on a Sunday, and ending w/ 23:59:59 on the
    # following Saturday... optionally in epoch format.

    if epoch:
        # saturday is date obj = 'midnight saturday'. We want the last second of the day, not the first.
        saturday_epoch = time.mktime(saturday.timetuple()) + 86399
        prev_sunday_epoch = time.mktime(prev_sunday.timetuple())
        last_week = (prev_sunday_epoch, saturday_epoch)
    else:
        saturday_str = saturday.strftime('%Y-%m-%d')
        prev_sunday_str = prev_sunday.strftime('%Y-%m-%d')
        last_week = (prev_sunday_str, saturday_str)
    return last_week

It would be easier to just have some attribute for datetime objects that lets you set the first day of the week to be Sunday instead of Monday. It wouldn’t completely alleviate every conceivable issue with calculating dates, but it would be a help. The calendar module has a setfirstweekday() method that lets you set the first weekday to whatever you want. I gather this is mostly for formatting output of matrix calendars, but it would be useful if it could be used in date calculations as well. Perhaps I’ve missed something? Clues welcome.

Mission 2: Calculate the Prior Month’s Start and End Dates

This should be easy. What I hoped would happen is I’d be able to get today’s date, and then create a timedelta object for ’1 month’, and subtract, having Python take care of things like changing the year when the current month is January. Calculating this yourself is a little messy: you can’t just use “30 days” or “31 days” as the length of a month, because:

  1. “January 31″ – “30 days” = “January 1″ — not the previous month.
  2. “March 1″ – “31 days” = “January 30″ — also not the previous month.

Instead, what I did was this:

  1. create a datetime object for the first day of the current month (hard coding the ‘day’ argument)
  2. used a timedelta object to subtract a day, which gives me a datetime object for the last day of the prior month (with year changed for me if needed),
  3. used that object to create a datetime object for the first day of the prior month (again hardcoding the ‘day’ argument)

Here’s some code:

today = datetime.datetime.today()
first_day_current = datetime.datetime(today.year, today.month, 1)
last_day_previous = first_day_current - datetime.timedelta(days=1)
first_day_previous = datetime.datetime(last_day_previous.year, last_day_previous.month, 1)
print 'Today: ', today
print 'First day of this month: ', first_day_current
print 'Last day of last month: ', last_day_previous
print 'First day of last month: ', first_day_previous

This outputs:

Today:  2010-07-06 09:57:33.066446
First day of this month:  2010-07-01 00:00:00
Last day of last month:  2010-06-30 00:00:00
First day of last month:  2010-06-01 00:00:00

Not nearly as onerous as the week start/end range calculations, but I kind of thought that between all of these modules we have that one of them would be able to find me the start and end of the previous month. The raw material for creating this is, I suspect, buried somewhere in the source code for the calendar module, which can tell you the start and end dates for a month, but can’t do any date calculations to give you the previous month. The datetime module can do calculation, but it can’t tell you the start and end dates for a month. The datetime.timedelta object’s largest granularity is ‘week’ if memory serves, so you can’t just do ‘timedelta(months=1)’, because the deltas are all converted internally to a fixed number of days, seconds, or milliseconds, and a month isn’t a fixed number of any of them.

Converge!

While I could probably go ahead and use dateutil, which is really darn flexible, I’d rather be able to do this without a third-party module. Also, dateutil’s flexibility is not without it’s complexity, either. It’s not an insurmountable task to learn, but it’s not like you can directly transfer your experience with the built-in modules to using dateutil.

I don’t think merging all of the time-related modules in Python would be necessary or even desirable, really, but I haven’t thought deeply about it. Perhaps a single module could provide a superclass for the various time-related objects currently spread across three modules, and they could share some base level functionality. Hard to conceive of a timedelta object not floating alone in space in that context, but alas, I’m thinking out loud. Perhaps a dive into the code is in order.

What have you had trouble doing with dates and times in Python? What docs have I missed? What features are completely missing from Python in terms of time manipulation that would actually be useful enough to warrant inclusion in the collection of included batteries? Let me know your thoughts.

Brain Fried Over NoSQL

By m0j0, June 26, 2010 10:16 pm

So, I’m working on a pet project. It’s in stealth mode. Just kidding — I don’t believe in stealth mode ;-)

It’s a twitter analytics dashboard that actually does useful things with the mountains of data available from the various Twitter APIs. I’m writing it in Python using Tornado. Here’s the first mockup I ever did for it, just like 2 nights ago:

It’s already a lot of fun. I’ve worked with Tornado before and like it a lot. I have most of the base infrastructure questions answered, because this is a pet project and they’re mostly easy and in some sense “don’t matter”. But that’s what has me stuck.

It Doesn’t Matter

It’s true. Past a certain point, belaboring choices of what tools to use where is pointless and is probably premature optimization. I’ve been working with startups for the past few years, and I’m painfully aware of what happens when a company takes too long to react to their popularity. I want to architect around that at the start, but I’m resisting. It’s a pet project.

But if it doesn’t matter, that means I can choose tools that are going to be fun to dig into and learn about. I’ve been so busy writing code to help avoid or buffer impact to the database that I haven’t played a whole lot with the NoSQL choices out there, and there are tons of them. And they all have a different world view and a unique approach to providing solutions to what I see as somewhat different problems.

Why NoSQL?

Why not? I’ve been working with relational database systems since 1998. I worked on large data reporting projects, a couple of huge data warehousing projects, financial transaction systems, I worked for Sybase as a consulting DBA and project manager for a while, I was into MySQL and PostgreSQL by 2000, used them in production environments starting around 2001-02… I understand them fairly well. I also understand BDB and other “flat-file” databases and object stores. SQLite has become unavoidable in the past few years as well. It’s not like I don’t understand the compromises I’m making going to a NoSQL system.

There’s a good bit of talk from the RDBMS camp (seriously, why do they need their own camp?) about why NoSQL is bad. Lots of people who know me  would put me in the RDBMS camp, and I’m telling you not to cry yourself to sleep out of guilt over a desire to get to know these systems. They’re interesting, and they solve some huge issues surrounding scalability with greater ease than an RDBMS.

Like what? Well, cost for one. If I could afford Oracle I’d sooner use that than go NoSQL in all likelihood. I can’t afford it. Not even close. Oracle might as well charge me a small planet for their product. It’s great stuff, but out of reach. And what about sharding? Sharding a relational database sucks, and to try to hide the fact that it sucks requires you to pile on all kinds of other crap like query proxies, pools, and replication engines, all in an effort to make this beast do something it wasn’t meant to do: scale beyond a single box. All this stuff also attempts to mask the reality that you’ve also thrown your hands in the air with respect to at least 2 letters that make up the ACID acronym. What’s an RDBMS buying you at that point? Complexity.

And there’s another cost, by the way: no startup I know has the kind of enormous hardware that an enterprise has. They have access to commodity hardware. Pizza boxes. Don’t even get me started on storage. I’ve yet to see SSD or flash storage at a startup. I currently work at MyYearbook.com, and there are some pretty hefty database servers there, but it can hardly be called a startup anymore. Hell, they’re even profitable! ;-)

Where Do I Start?

One nice thing about relationland is I know the landscape pretty well. Going to NoSQL is like dropping me in a country I’ve never heard of where I don’t really speak the language. I have some familiarity with key-value stores from dealing with BDB and Memcache, and I’ve played with MongoDB a bit (using pymongo), but that’s just the tip of the iceberg.

I heard my boss mention Tokyo Tyrant a few times, so I looked into it. It seems to be one of the more obscure solutions out there from the standpoint of adoption, community, documentation, etc., but it does appear to be very capable on a technical level. However, my application is going to be number-heavy, and I’m not going to need to own all of the data required to provide the service. I can probably get away with just incrementing counters in Memcache for some of this work. For persistence I need something that will let me do aggregation *FAST* without having to create aggregation tables, ideally. Using a key/value store for counters really just seems like a no-brainer.

That said, I think what I’ve decided to do, since it doesn’t matter, is punt on this decision in favor of getting a working application up quickly.

MySQL

Yup. I’m going to pick one or two features of the application to implement as a ‘first cut’, and back them with a MySQL database. I know it well, Tornado has a built-in interface for it, and it’s not going to be a permanent part of the infrastructure (otherwise I’d choose PostgreSQL in all likelihood).

To be honest, I don’t think the challenge in bringing this application to life are really related to the data model or the engine/interface used to access it (though if I’m lucky that’ll be a major part of keeping it alive). No, the real problem I’m faced with is completely unrelated to these considerations…

Twitter’s API Service

Not the API itself, per se, but the service providing access to it, and the way it’s administered, is going to be a huge challenge. It’s not just the Twitter website that’s inconsistent, the API service goes right along. Not only that, but the type of data I really need to make this application useful isn’t immediately available from the API as far as I can tell.

Twitter maintains rate limits on the API. You can only make so many calls over so short a period of time. That alone makes providing an application like this to a lot of people a bit of a challenge. Compounding the issue is that, when there are failwhales washing up on the shores, those limits can be dynamically decreased. Ugh.

I guess it’s not a project for the faint of heart, but it’ll drive home some golden rules that are easy to neglect in other projects, like planning for failure (of both my application, and Twitter). Also, it’ll be a lot of fun.

Python IDE Frustration

By m0j0, May 13, 2010 9:22 pm

I didn’t think I was looking for a lot in an IDE. Turns out what I want is impossibly hard to find.

In the past 6 months I’ve tried (or tried to try):

  • Komodo Edit
  • Eclipse w/ PyDev
  • PyCharm (from the first EAP build to… yesterday)
  • Wingware
  • Textmate

Wingware

First, let’s get Wingware out of the way. I’m on a Mac, and if you’re not going to develop for the Mac, I’m not going to pay you hundreds of dollars for your product. Period. I don’t even use free software that requires X11. Lemme know when you figure out that coders like Macs and I’ll try Wingware.

Komodo Edit

Well, I wanted to try the IDE but I downloaded it, launched it once for 5 minutes (maybe less), forgot about it, and now my trial is over. I’ll email sales about this tomorrow. In the meantime, I use Komodo Edit.

Komodo Edit is pretty nice. One thing I like about it is that it doesn’t really go overboard forcing its world view down my throat. If I’m working on bunny, which is a one-file Python project I keep in a git repository, I don’t have to figure out their system for managing projects. I can just “Open File” and use it as a text editor.

It has “ok” support for Vi key bindings, and it’s not a plugin: it’s built in. The support has some annoying limitations, but for about 85% of what I need it to do it’s fine. One big annoyance is that I can’t write out a file and assign it a name (e.g. ‘:w /some/filename.txt’). It’s not supported.

Komodo Edit, unless I missed it, doesn’t integrate with Git, and doesn’t offer a Python console. Its capabilities in the area of collaboration in general are weak. I don’t absolutely have to have them, but things like that are nice for keeping focused and not having to switch away from the window to do anything else, so ideally I could get an IDE that has this. I believe Komodo IDE has these things, so I’m looking forward to trying it out.

Komodo is pretty quick compared to most IDEs, and has always been rock solid stable for me on both Mac and Linux, so if I’m not in the mood to use Vim, or I need to work on lots of files at once, Komodo Edit is currently my ‘go-to’ IDE.

PyCharm

PyCharm doesn’t have an officially supported release. I’ve been using Early Adopter Previews since the first one, though. When it’s finally stable I’m definitely going to revisit it, because to be honest… it’s kinda dreamy.

Git integration is very good. I used it with GitHub without incident for some time, but these are early adopter releases, and things happen: two separate EAP releases of PyCharm made my project files completely disappear without warning, error, or any indication that anything was wrong at all. Of course, this is git, so running ‘git checkout -f’ brought things back just fine, but it’s unsettling, so now I’m just waiting for the EAP to be over with and I’ll check it out when it’s done.

I think for the most part, PyCharm nails it. This is the IDE I want to be using assuming the stability issues are worked out (and I don’t have reason to believe they won’t be). It gives me a Python console, VCS integration, a good class and project browser, some nice code analytics, and more complex syntax checking that “just works” than I’ve seen elsewhere. It’s a pretty handsome, very intuitive IDE, and it leverages an underlying platform whose plugins are available to PyCharm users as well, so my Vim keys are there (and, by the way, the IDEAVim plugin is the most advanced Vim support I’ve seen in any IDE, hands down).

Eclipse with PyDev

One thing I learned from using PyCharm and Eclipse is that where tools like this are concerned, I really prefer a specialized tool to a generic one with plugins layered on to provide the necessary functionality. Eclipse with PyDev really feels to me like a Java IDE that you have to spend time laboriously chiseling, drilling, and hammering to get it to do what you need if you’re not a Java developer. The configuration is extremely unintuitive, with a profuse array of dialogs, menus, options, options about options and menus, menus about menus and options… it never seems to end.

All told, I’ve probably spent the equivalent of 2 working days mucking with Eclipse configuration, and I’ve only been able to get it “pretty close” to where I want it. The Java-loving underpinnings of the Eclipse platform simply cannot be suppressed, while things I had to layer on with plugins don’t show up in the expected places.

Add to this Eclipse’s world-view, which reads something like “there is no filesystem tree: only projects”, and you have a really damned annoying IDE. I’ve tried on and off for over a year to make friends with Eclipse because of the good things I hear about PyDev, but it just feels like a big hacky, duct-taped mess to me, and if PyCharm has proven anything to me, it’s that building a language specific IDE on an underlying platform devoted to Java doesn’t have to be like this. When I finally got it to some kind of usable point, and after going through the “fonts and colors” maze, it turns out the syntax highlighting isn’t really all that great!

A quick word about Vi key bindings in Eclipse: it’s not a pretty picture, but the best I’ve been able to find is a free tool called Vrapper. It’s not bad. I could get by with Vrapper, but I don’t believe it’s as mature and evolved as IDEAVim plugin in PyCharm.

So, I’ll probably turn back to Eclipse for Java development (I’m planning on taking on a personal Android project), but I think I’ve given up on it for anything not Java-related.

Vim

Vim is technically ‘just an editor’, but it has some nice benefits, and with the right plugins, it can technically do all of the things a fancy IDE can. I use the taglist plugin to provide the project and class browser functionality, and the kicker here is that you can actually switch to the browser pane, type ‘/’ and the object or member you’re looking for, and jump to it in a flash. It’s also the most complete Vim key binding implementation available ;-)

The big win for me in using Vim though is remote work. Though I’d rather do all of my coding locally, there are times when I really have to write code on remote machines, and I don’t want to go through the rigmarole of coding, pushing my changes, going to my terminal, pulling down the changes, testing, failing, fixing the code on my machine, pushing my changes, pulling my changes… ugh.

So why not just use Vim? I could do it. I’ve been using Vim for many years and am pretty good with it, but I just feel like separating my coding from my terminal whenever I can is a good thing. I don’t want my code to look like my terminal, nor do I want my terminal to look like my IDE theme. I’m SUPER picky about fonts and colors in my IDE, and I’m not that picky about them in my terminal. I also want the option of using my mouse while I’m coding, mostly to scroll, and getting that to work on a Mac in Terminal.app isn’t as simple as you might expect (and I’m not a fan of iTerm… and its ability to do this comes at a cost as well).

MacVim is nice, solves the separation of Terminal and IDE, and I might give it a more serious try, but let’s face it, it’s just not an IDE. Code completion is still going to be mediocre, the interface is still going to be terminal-ish… I just don’t know. One thing I really love though is the taglist plugin. I think if I could just find a way to embed a Python console along the bottom of MacVim I might be sold.

One thing I absolutely love about Vim, the thing that Vim gets right that none of the IDEs get is colorschemes: MacVim comes with like 20 or 30 colorschemes! And you can download more on the ‘net! The other IDEs must lump colorscheme information into the general preferences or something, because you can’t just download a colorscheme as far as I’ve seen. The IDE with the worst color/font configuration? Eclipse – the one all my Python brethren seem to rave about. That is so frustrating. Some day I’ll make it to PyCon and someone will show me the kool-aid I guess.

The Frustrating Conclusion

PyCharm isn’t soup yet, Wingware is all but ignoring the Mac platform, Eclipse is completely wrong for my brain and I don’t know how anyone uses it for Python development, Komodo Edit is rock solid but lacking features, and Komodo IDE is fairly pricey and a 30-day trial is always just really annoying (and I kinda doubt it beats PyCharm for Python-specific development). MacVim is a stand-in for a real IDE and it does the job, but I really want more… integration! I also don’t like maintaining the plugins and colorschemes and *rc files and ctags, and having to understand its language and all that.

I don’t cover them here, but I’ve tried a bunch of the Linux-specific Python IDEs as well, and I didn’t like a single one of them at all. At some point I’ll spend more time with those tools to see if I missed something crucial that, once learned, might make it hug my brain like a warm blanket (and make me consider running Linux on my desktop again, something I haven’t done on a regular ongoing basis in about 4 years).

So… I don’t really have an IDE yet. I *did* however just realize that the laptop I’m typing on right now has never had a Komodo IDE install, so I’m off to test it now. Wish me luck!

PyTPMOTW: PsycoPG2

By m0j0, April 21, 2010 8:29 pm

What is this module for?

Interacting with a PostgreSQL database in Python.

What is PostgreSQL?

PostgreSQL is an open source relational database product. It has some more advanced features, like built-in networking-related and GIS-related datatypes, the ability to script stored functions in multiple languages (including Python), etc. If you have never heard of PostgreSQL, get out from under your rock!

Making Contact

Using the pscyopg2 module to connect to a PostgreSQL database couldn’t be simpler. You can use the connect() method of the module, passing in either the individual arguments required to make contact (dbname, user, etc), or you can pass them in as one long “DSN” string, like this:

dsn = "host=localhost port=6000 dbname=testdb user=jonesy"
conn = psycopg2.connect(dsn)
conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT)

The DSN value is a space-delimited collection of key=value pairs, which I construct before sending the dsn to the psycopg2.connect() method. Once we have a connection object, the very first thing I do is set the connection’s isolation level to ‘autocommit’, so that INSERT and UPDATE transactions are committed automatically without my having to call conn.commit() after each transaction. There are several isolation levels defined in the psycopg2.extensions package, and they’re defined in ‘extensions’ because they go beyond what is defined in the DB API 2.0 spec that is typically used as a reference in creating Python database modules.

Simple Queries and Type Conversion

In order to get anything out of the database, we have to know how to talk to it. Of course this means writing some SQL, but it also means sending query arguments in a format understood by the database. I’m happy to report that psycopg2 does a pretty good job of making things “just work” when it comes to converting your input into PostgreSQL types, and converting the output directly into Python types for easy manipulation in your code. That said, understanding how to properly use these features can be a bit confusing at first, so let me address the source of a lot of early confusion right away:

cur = conn.cursor()
cur.execute("""SELECT id, fname, lname, balance FROM accounts WHERE balance > %s""", min_balance)

Chances are, min_balance is an integer, but we’re using ‘%s’ anyway. Why? Because this isn’t really you telling Python to do a string formatting operation, it’s you telling psycopg2 to convert the incoming data using the default psycopg2 method, which converts integers into the PostgreSQL INT type. So, you can use “%s” in the ‘execute()’ method to properly convert integers, strings, dates, datetimes, timedeltas, lists, tuples and most other native Python types to a corresponding PostgreSQL type. There are adapters built into psycopg2 as well if you need more control over the type conversion process.

Cursors

Psycopg2 makes it pretty easy to get your results back in a format that is easy for the receiving code to deal with. For example, the projects I work on tend to use the  RealDictCursor type, because the code tends to require accessing the parts of the resultset rows by name rather than by index (or just via blind looping). Here’s how to set up and use a RealDictCursor:

curs = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
curs.execute("SELECT id, name FROM users")
rs = curs.fetchall()
for row in rs:
   print rs['id'], rs['name']

It’s possible you have two sections of code that’ll rip apart a result set, and one needs by-name access, and the other just wants to loop blindly or access by index number. If that’s the case, just replace ‘RealDictCursor’ with ‘DictCursor’, and you can have it both ways!

Another nice thing about psycopg2 is the cursor.query attribute and cursor.mogrify method. Mogrify allows you to test and see how a query will look after all input variables are bound, but before the query is sent to the server. Cursor.query prints out the exact query that was actually sent over the wire. I use cursor.query in my logging output all the time to catch out-of-order parameters and mismatched input types, etc. Here’s an example:

try:
    curs.callproc('myschema.myprocedure', callproc_params)
except Exception as out:
    print out
    print curs.query

Calling Stored Functions

Stored procedures or ‘functions’ in PostgreSQL-speak can be immensely useful in large complex applications where you want to enforce business rules in a single place outside the domain of the main application developers. It can also in some cases be more efficient to put functionality in the database than in the main application code. In addition, if you’re hiring developers, they should develop in the standard language for your environment, not SQL: SQL should be written by database administrators and developers, and exposed to the developers as needed, so all the developers have to do is call this newly-exposed function. Here’s how to call a function using psycopg2:

callproc_params = [uname, fname, lname, uid]
cur.callproc("myschema.myproc", callproc_params)

The first argument to ‘callproc()’ is the name of the stored procedure, and the second argument is a sequence holding the input parameters to the function. The input parameters should be in the order that the stored procedure expects them, and I’ve found after quite a bit of usage that the module typically is able to convert the types perfectly well without my intervention, with one exception…

The UUID Array

PostgreSQL has built-in support for lots of interesting data types, like INET types for supporting IP addresses and CIDR network blocks, and GIS-related data types. In addition, PostgreSQL supports a type that is an array of UUIDs. This comes in handy if you use a UUID to identify items and want to store an array of them to associate with an order, or you use UUIDs to track messages and want to store an array of them together to represent a message thread or conversation. To get a UUID array into the database quickly and easily, it’s really not too difficult. If you have a list of strings that are UUID strings, you can do a quick conversion, call one function, and then use the array like any other input parameter:

my_uuid_arr = [uuid.UUID(i) for i in my_uuid_arr]
psycopg2.extras.register_uuid()
callproc_params = [
myvar1,
myvar2,
my_uuid_arr
]

curs.callproc('myschema.myproc', callproc_params)

Connection Status

It’s not a given that your database connection lives on from query to query, and you shouldn’t really just assume that because you did a query a fraction of a second ago that it’s still around now. Actually, to speak about things more Pythonically, you *should* assume the connection is still there, but be ready for failure, and check the connection status to diagnose and help get things back on track. You can check the ‘status’ attribute of your connection object. Here’s one way you might do it:

    @property
    def active_dbconn(self):
        return self.conn.status in [psycopg2.extensions.STATUS_READY, psycopg2.extensions.STATUS_BEGIN]:

So, I’m assuming here that you have some object that has a connection object that it refers to as ‘self.connection’. This one-liner function uses the @property built-in Python decorator, so the other methods in the class can either check the connection status before attempting a query:

if self.active_dbconn:
    try:
        curs.execute(...)
    except Exception as out:
         logging.error("Houston we have a problem")

Or you can flip that around like this:

try:
   curs.execute(...)
except Exception as out:
    if not self.active_dbconn:
        logging.error("Execution failed because your connection is dead")
    else:
         logging.error("Execution failed in spite of live connection: %s" % out)

Read On…

A database is a large, complex beast. There’s no way to cover the entirety of a database or a module that talks to it in a simple blog post, but I hope I’ve been able to show some of the more common features, and maybe one or two other items of interest. If you want to know more, I’m happy to report that, after a LONG time of being unmaintained, the project has recently sprung back to life and is pretty well-documented these days. Check it out!

PyTPMOTW: PyYAML

By m0j0, April 12, 2010 8:47 pm

What’s This Module For?

Reading and writing files formatted using “YAML Ain’t Markup Language”" (YAML), and converting YAML syntax into native Python objects and datatypes.

What is YAML?

According to the website which houses the YAML Specification:

YAML™ (rhymes with “camel”) is a human-friendly, cross language, Unicode
based data serialization language designed around the common native data
structures of agile programming languages. It is broadly useful for
programming needs ranging from configuration files to Internet messaging to
object persistence to data auditing.

My introduction to YAML came several years ago in the context of messaging, and I then had a run-in with YAML as a logging format (actually, I was trying to parse a MySQL slow query log by coaxing it into YAML format). However, when I started writing Python full time, working on several different initiatives, YAML quickly became the standard configuration format.

Why? Simplicity. Using YAML for our config files and PyYAML to parse them, any developer can figure out what’s happening in our application in a matter of minutes, even if Python is not their primary language. It’s also nice that the YAML syntax is parsed into native Python datatypes, so Python coders looking at a config file can start to get a pretty good picture of how the program basically works.

The other thing that makes it simpler than some other config-specific options is that there’s not a lot of underlying “stuff” to know about. YAML isn’t a configuration engine, it’s essentially just a way to deal with data structures without locking the format to a specific language.

I also happen to like that it’s not config-specific, because it means that if I later need a messaging format, I already know one, and am familiar with a certain Python module to work with it!

Basic Usage

Let’s write a very simple YAML configuration for the logging portion of anapplication:

%YAML 1.2
---
Logging:
format: "%(levelname) -10s %(asctime)s %(module)s:%(funcName)s()  %(message)s"
level: 10
...

I’ve put logging-related configuration in its own “section” (really data structure) here so when I want to configure other things in the application I can do so without shooting myself in the foot and having to be careful not to use the same key names, etc.

I’ve stored this configuration in a file called ‘log.conf’. From there you can easily play with it in an interpreter session:

>>> import yaml
>>> config_file = open('log.conf', 'r')
>>> config = yaml.load(config_file)
>>> config
{'Logging': {'format': '%(levelname) -10s %(asctime)s %(module)s:%(funcName)s()  %(message)s', 'level': 10}}
>>>

With the configuration out of the way, let’s look at the code that would use it:

#!/usr/bin/env python

import logging
import yaml

def doit(uid):
    logging.debug("Working with uid: %s" % uid)

if __name__ == "__main__":
    config_file = open('log.conf', 'r')
    config = yaml.load(config_file)
    config_file.close()
    logging.basicConfig(**config['Logging'])

    doit(22222)

logging.basicConfig() takes a keyword dictionary of optional configuration items. Here I’m just using the ‘format’ and ‘level’ options, but there are more.

The only thing I do inside the doit() function is use logging to output the value of ‘uid’ passed in. This is really a test that the format I’ve configured is actually being used.

The format is fairly intuitive: indentation defines a block, just like in Python. The ‘—’ and ‘…’ lines denote the beginning and end of the YAML document. You can have several documents in a file if you so choose. This might be done if you’re storing a feed or email threads in YAML format.

Type Conversion

Type conversion to the built in Python primitives works very well and is very intuitive in my experience. The above would be parsed as a string for the ‘format’ key, and an ‘int’ for the ‘level’ key. The entire block above will become a dictionary, and there is YAML syntax you can use to create lists and lists of lists, etc., as well.

For example, let’s say I’m creating a Django-like web application framework and I’ve decided to store my URL-to-handler mappings in a YAML file. You could easily do it with a list of lists, which looks like this in YAML:

RequestHandlers:
- [/, framework.handlers.RootHandler]
- [/signup, framework.handlers.RegisterNow]
- [/login, framework.handlers.Login]
- [/faq, framework.handlers.FAQ]

This will form a list of lists that you can work with in your code that looks like this in the config dictionary:

{'RequestHandlers': [['/', 'framework.handlers.RootHandler'], ['/signup',
'framework.handlers.RegisterNow'], ['/login', 'framework.handlers.Login'],
['/faq', 'framework.handlers.FAQ']]}

If for some reason type conversion doesn’t work as you expect, or you need to represent, say, a boolean using a string like “y” or “Yes” instead of “True”, you can explicitly tag your value using tags defined in the YAML specification for this very purpose. Here’s how you’d explicitly tag “Yes” as a boolean, to insure it’s not parsed as a string:

verbose: !!bool "Yes"

When this is parsed by PyYAML, it will be a Python boolean, and the value when printed to the screen will be ‘True’ (without quotes). There are several other explicit type tags, including ‘!!int’, ‘!!float’, ‘!!null’, ‘!!timestamp’ and more.

If you like, you could alter our URL mapper from above and create a list of tuples. Note the use of the !!omap tag, which is short for ‘ordered mapping’:

RequestHandlers: !!omap
- /: framework.handlers.RootHandler
- /signup: framework.handlers.RegisterNow
- /login: framework.handlers.Login
- /faq: framework.handlers.FAQ

The resulting config dictionary looks like this:

{'RequestHandlers': [('/', 'framework.handlers.RootHandler'), ('/signup',
'framework.handlers.RegisterNow'), ('/login', 'framework.handlers.Login'),
('/faq', 'framework.handlers.FAQ')]}

More than once I’ve gone back to my YAML configuration to alter the type of data structure returned to better suit the code that uses it. It’s pretty convenient, and making the changes to both the configuration file and the code are typically easy enough to be considered a non-event.

Beyond Basic Data Types

The ‘level’ option in logging.basicConfig can be specified either as a word or a numeric value (internally, logging.DEBUG maps to the integer value 10). But what if you didn’t know this, or you didn’t have the option of using an integer? Specifying ‘logging.DEBUG’ in the config file wouldn’t have worked, because it would’ve come in as a string, and not an exposed module name.

If you don’t care about locking your configuration file to a language, PyYAML will let you do what you need using language-specific tags. So, for the purposes of our program, the following two lines in YAML produce the same effect:

level: 10
level: !!python/name:logging.DEBUG

You might also choose to do this because reading ‘logging.DEBUG’, even with the added tag overhead, is probably easier to understand than trying to figure out what “10″ means.

If you’re developing code that allows users to write plugins, you can also let them add their plugins by adding a simple line to a ‘plugin’ section of the YAML config file in such a way that the config dictionary itself will contain an actual new instance of the proper object:

Plugins:
- !!python/object/new:MyPlugin.Processor [logfile='foo.log']
- !!python/object/new:FooPluginModule.CementMixers.RotaryMixer
[consistency='chunky']

The above will produce a list of plugin instances with ‘args’ in the appended list fed to each classes __init__ method. Don’t forget that if you want to access the plugins by name instead of looping over a list, you can easily make this a dictionary. Also, PyYAML supports passing more intialization info to the class constructor.

Anchors and Aliases

You can create a block in your YAML config file, and then reference it in other sections of the configuration, and it can save you a lot of lines in a more complex configuration. This is done using anchors and aliases. An anchor starts with “&” and an alias (a reference to the anchor) begins with a “*”. So, let’s say you have multiple plugins loaded (continuing on from the example), and they all need their own configuration, but they’ll all connect to the same exact database server, and use the same credentials and db name, etc. Just create the db config once, make it an anchor, and reference it as needed:

DB: &MainDB
   server: localhost
   port: 6000
   user: dbuser
   db: myappdb
Plugins:
   loghandler: !!python/object/new:MyLogHandler
      args: ['mylogfile.log']
      db: *MainDB

When this is read in, the dictionary defined in &MainDB will appear as the value for the dict key ['Plugins']['loghandler']['db']. If you wanted to pass the *entire* config structure to your plugin, you technically wouldn’t need this, but I typically would only pass the portion of the config structure specifically dealing with the plugin, because configs can get large, and there could be lots of stuff that have nothing to do with the plugin in the rest of the config.

Moving Ahead

Although 90% of your use of PyYAML might well consist of loading a YAML file or message and working with the resulting data structure, it’s nice to know that it does provide quite a bit of flexibility if you’re willing to look for it. Here are some links for further reading about PyYAML, including a couple of items not covered in this tutorial:

Pass more initialization data to classes specified with !!python/object/new

Create your own app-specific tags, a la ‘!!bool’ and ‘!!python’.

Dump Python Objects to YAML

Tornado’s Big Feature is Not ‘Async’

By m0j0, April 4, 2010 9:43 pm

I’ve been working with the Tornado web server pretty much since its release by the Facebook people several months ago. If you’ve never heard of it, it’s a sort of hybrid Python web framework and web server. On the framework side of the equation, Tornado has almost nothing. It’s completely bare bones when compared to something like Django. On the web server side, it is also pretty bare bones in terms of hardcore features like Apache’s ability to be a proxy and set up virtual hosts and all of that stuff. It does have some good performance numbers though, and the feature that seems to drive people to Tornado seems to be that it’s asynchronous, and pretty fast.

I think some people come away from their initial experiences with Tornado a little disheartened because only upon trying to benchmark their first real app do they come face to face with the reality of “asynchronous”: Tornado can be the best async framework out there, but the minute you need to talk to a resource for which there is no async driver, guess what? No async.

Some people might even leave the ring at this point, and that’s a shame, because to me the async features in Tornado aren’t what attract me to it at all.

Why Tornado, if Not For Async?

For me, there’s an enormous win in going with Tornado (or other things like it), and to get this benefit I’m willing to deal with some of Tornado’s warts and quirks. I’m willing to deal with the fact that the framework provides almost nothing I’m used to having after being completely spoiled by Django. What’s this magical feature you ask? It’s simply the knowledge that, in Tornado-land, there’s no such thing as mod_wsgi. And no mod_python either. There’s no mod_anything.

This means I don’t have to think about sys.path, relative vs. absolute paths, whether to use daemon or embedded mode, “Cannot be loaded as Python module” errors, “No such module” errors, permissions issues, subtle differences between Django’s dev server and Apache/mod_wsgi, reconciling all of these things when using/not using virtualenv, etc. It means I don’t have to metascript my way into a working application. I write the app. I run the app.

Wanna see how to create a Tornado app? Here’s one right here:

import tornado.httpserver
import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("This is a Tornado app")

application = tornado.web.Application([
    (r"/", MainHandler),
])

if __name__ == "__main__":
    http_server = tornado.httpserver.HTTPServer(application)
    http_server.listen(8888)
    tornado.ioloop.IOLoop.instance().start()

Save this to whatever file you want, run it, and do ‘curl http://localhost:8888′ and you’ll see ‘This is a Tornado app’ on your console.

Simplistic? Yes, absolutely. But when you can just run this script, put it behind nginx, and have it working in under five minutes, you dig a little deeper and see what else you can do with this thing. Turns out, you can do quite a bit.

Can I Do Real Work With This?

I’ve actually been involved in a production launch of a non-trivial service running on Tornado, and it was mind-numbingly easy. It was several thousand lines of Python, all of which was written by two people, and the prototype was up and running inside of a month. Moving from prototype to production was a breeze, and the site has been solid since its launch a few months ago.

Do You Miss Django?

I miss *lots* of things about Django, sure. Most of all I miss Django’s documentation, but Tornado is *so* small that you actually can find what you need in the source code in 2 minutes or less, and since there aren’t a ton of moving parts, when you find what you’re looking for, you just read a few lines and you’re done: you’re not going to be backtracking across a bunch of files to figure out the process flow.

I also miss a lot of what I call Django’s ‘magic’. It sure does a lot to abstract away a lot of work. In place of that work, though, you’re forced to take on a learning curve that is steeper than most. I think it’s worth getting to know Django if you’re a web developer who hasn’t seen it before, because you’ll learn a lot about Python and how to architect a framework by digging in and getting your hands dirty. I’ve read seemingly most books about Django, and have done some development work in Django as well. I love it, but not for the ease of deployment.

I spent more time learning how to do really simple things with Django than it took to:

  1. Discover Tornado
  2. Download/install and run ‘hello world’
  3. Get a non-trivial, commercial application production-ready and launch it.

Deadlines, indeed!

Will You Still Work With (Django/Mingus/Pinax/Coltrane/Satchmo/etc)?

Sure. I’d rather not host it, but if I have to I’ll get by. These applications are all important, and I do like developing with them. It’s mainly deployment that I have issues with.

That’s not to say I wouldn’t like to see a more mature framework made available for Tornado either. I’ve worked on one, though it’s not really beyond the “app template” phase at this point. Once the app template is able to get out of its own way, I think more features will start to be added more quickly… but I digress.

In the end, the astute reader will note that my issue isn’t so much with Django-like frameworks (though I’ll note that they don’t suit every purpose), but rather with the current trend of using mod_wsgi for deployment. I’ll stop short of bashing mod_wsgi, because it too is an important project that has done wonders for the state of Python in web development. It really does *not* fit my brain at all, though, and I find when I step into a project that’s using it and it has mod_wsgi-related problems, identifying and fixing those problems is typically not a simple and straightforward affair.

So, if you’re like me and really want to develop on the web with Python, but mod_wsgi eludes you or just doesn’t fit your brain, I can recommend Tornado. It’s not perfect, and it doesn’t provide the breadth of features that Django does, but you can probably get most of your work done with it in the time it took you to get a mod_wsgi “Hello World!” app to not return a 500 error.

Programmers that… can’t program.

By m0j0, March 15, 2010 7:06 pm

So, I happened across this post about hiring programmers, which references two other posts about hiring programmers. There seems to be a demand for blog posts about hiring programmers, but that’s not why I’m writing this. I’m writing because there was this sort of nagging irony that I couldn’t help but stumble upon.

In a blog post, Joel Spolsky talks about the mathematical inaccuracies associated with claims of “only hiring the top 1%”. It seemed pretty obvious to me that whether or not you’re hiring the top 1% of all programmers is pretty much unknowable, and when managers say they hire “the top 1%”, I assume they’re talking about the top 1% of their applicants. Note too that I always thought it was idiotic to point this out, because, well, isn’t that what you’re SUPPOSED to do? You’re not very well going to aim for the middle & hope for the best are you?

Apparently I’ve been giving too much credit to management. There I go giving people with ties on the benefit of the doubt again.

Then, in another blog post, Jeff Atwood talks about how it’s very difficult to even get interviews with programmers who can actually program. The problem is real.

The original blog post that pointed me at the two others is one by Roberto Alsina where he talks about his own methods for weeding out the non-programmers. He’s clearly seen the issue as well.

But if you open all three of these posts in separate tabs and read them, you’re likely to come away with the same basic problem I did:

  • Who the hell are these managers who can’t figure out a dead simple statistics problem?
  • How can a person fairly inept at simple math be qualified to make a hiring decision for anything but a summer intern?

That sorta blew my mind a little. But it blew my mind a lot when Atwood started describing the problems that interviewees *couldn’t* perform in an interview! One task described by Imran was called a ‘FizzBuzz’ question. Here’s one such question:

Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

Here’s the part that blew my mind: He says, and I quote:

Most good programmers should be able to write out on paper a program which does this in a under a couple of minutes.

Want to know something scary ? – the majority of comp sci graduates can’t. I’ve also seen self-proclaimed senior programmers take more than 10-15 minutes to write a solution.

That’s amazing to me. I decided to quickly pop open a Python prompt and see if I could do it:

>>> for i in range(1,101):
...     if (i % 3 == 0) and (i % 5 == 0):
...             print i,'FizzBuzz'
...     elif i % 3 == 0:
...             print i, 'Fizz'
...     elif i % 5 == 0:
...             print i, 'Buzz'
...     else:
...             print i
...

Note that I’ve taken the liberty of printing out the numbers in addition to the required words. I’m playing the role of interviewer and interviewee here, and wanted to be able to easily verify that things were correct, since there was no time for unit testing :)

Turns out it worked on the first try! That was pasted directly from my terminal screen. I didn’t time myself, but it took far less than 5 minutes. This leads to my other question, of course, which is “if you’re going to complain about CS degree holders not writing good code, maybe it’s time to open the doors to non-CS degree holders?”

Seeking Elegant Pythonic Solution

By m0j0, February 1, 2010 4:31 pm

So, I have some code that queries a data source, and that data source sends me back an XML message. I have to parse the XML message so I can store information from it into a relational database. So, let’s say my XML response looks like this:

<xml>
<response>
<results=2>
  <result>
    <fname>Brian</fname>
    <lname>Jones</lname>
    <gender>M</gender>
    <office_phone_ext>777</office_phone_ext>
    <mobile_phone>201-555-1212</mobile_phone>
  </result>
  <result>
    <fname>Molly</fname>
    <lname>Jones</lname>
    <home_phone>201-555-1234</home_phone>
  </result>
</results>
</xml>

So, as you can see, the attributes for each result returned for a query can differ, and if a result doesn’t have a value for some attribute, the corresponding xml element isn’t included at all for that result. If it were just 2 or 3 attributes, I could easily enough get around it by doing something like this:

def __init__(self, xmlresult):
  self.xmlresult = xmlresult
  if self.xmlresult.xpath('fname') is not None:
    self.fname = self.xmlresult.xpath('fname')
  if self.xmlresult.xpath('lname') is not None:
    self.lname = self.xmlresult.xpath('lname')

Like I said, if it were just a few things I needed to check for, I’d do it this way and be done with it. It’s not just a few though — it’s like 50 attributes. Now what?

I decided lxml.objectify would be a great way to go. It would allow me to access these things as object attributes, which should mean I can do something like this:

self.fname = getattr(self.xmlresult, 'fname', None)
self.lname = getattr(self.xmlresult, 'lname', None)
...

So, you *can* do this, technically speaking. Trouble is, you’re asking for an attribute of an ObjectifiedElement object, and when you do that, it returns an object that is not a native Python datatype, which I did not realize when I first started using lxml.objectify. So, in the above, ‘self.fname’ will not be a Python string — it’ll be an lxml.objectify.StringElement object. Of course, my database driver, my ‘join()’ operations, and everything else in my code that relies on native Python datatypes is now broken.

What I actually need to do is get the ‘.pyval’ attribute of self.xmlresult.fname, if that attribute exists at all. So, something that does what I mean, which is “self.fname = getattr(self.xmlresult, ‘fname.pyval’, None). And, of course, doing ‘getattr(self.xmlresult, ‘fname’, None).pyval’ doesn’t work because None has no attribute ‘pyval’. I’ve tried a couple of other hacks too, but I’ve learned enough Python to know that if it feels like a hack, there’s probably a better way. But I can’t find that better way. Ideas?

What “Batteries Included” Means

By m0j0, January 22, 2010 10:15 pm

When I first got into Python, I read lots of blog posts that mentioned that Python was “the batteries included language”, but those same posts were short on any explanation of what that really meant. A few years and lots of projects later, I think I’m now qualified to at least give a beginner a basic understanding of what people mean when they say that.

What it means

Python has what’s called the “Standard Library”, which is a collection of modules to make some set of tasks within a particular problem domain simpler on you. The standard library modules are all part of the standard Python installation — you don’t have to add it to your installation. Here is a short list of things I’ve used Python for, and the standard library modules I used to get the work done:

  • Wrote a simple filesystem backup routine that works on my Linux and Solaris servers, as well as my Mac laptop. For this I used the os, stat, bz2, gzip, time, datetime, and tar modules. Oh – and optparse!
  • The first iteration of the loghetti Apache log file analysis tool used one external module, which I’m replacing now with optparse. The other modules used are urlparse, cgi, datetime, operator, re, sys, and mmap.
  • I’ve written a couple of simple web API clients using nothing more than urllib/urllib2, ElementTree, and various bits of the xml package (xml.minidom comes to mind).
  • I wrote a MySQL backup script using the sys, os, time, shutil, glob, tarfile, and optparse modules.

There are built-in modules for XML/HTML parsing, url parsing, network communications, threading and multiprocessing, image and audio manipulation, and lots of other tasks you’re likely to come across. The story of Python cannot be told from the standard library alone, but you can do an awful lot of work with what’s provided.

Why you care

You care because you want to write the code that solves the problem at hand without worrying about a whole lot of low-level details like socket communications and memory management. I mean, it’s nice to know that Python exposes those low-level details to you should you ever get a wild hair, but if you frequently had occasion to really need that, you’d probably code in C.

Perhaps ironically, you also care about the batteries included because of what it means for those batteries that *aren’t* included. Chances are the Python-driven applications and external modules you wind up using make very heavy use of the standard library modules, making the code for those add-ons simpler to read and understand. It has the potential to make them more reliable and consistent as well.

I’d rather see a threaded application using Python’s threading module rather than trying to code the threading implementation by hand. I’d rather see a Python web server using some descendant of Python’s socket module than coding socket operations by hand. Having things like socket and threading in the standard library means that tools across various problem domains that happen to require certain common functionality work in a standard way where that functionality is concerned.

The same holds true for your own code. If you need to write a Jabber messaging server in Python, and six months later you need to write a queue-based networked job dispatching server, you’re going to be using similar modules, and you might even be able to reuse some of your old Jabber server code. Reuse for the win!

Remember LEGO? I was obsessed with LEGO. I remember becoming so intimately familiar with every type of block, window, platform, person, light post, wheel, and car base that envisioning my own custom-made Grand Galaxy Cruiser was easy, and actually building it (from pieces of probably 10 different LEGO sets) was only slightly more difficult. The standard library, in essence, is your LEGO set. Really, instead of “batteries included” Python’s mantra could well be “There’s a block for that” :)

It’s not all beer and skittles

Even in LegoLand there are pieces that don’t fit together the way you’d like. I wanted to build a house once with car windshields in place of windows. Turns out it’s not so simple.

Python 3 improves things a lot in terms of how the standard library is organized, but if you have an external module your code depends on, it might not be ready for Python 3, which means (like me and many others), you’re stuck in Python 2-land for a while. It’s not a big deal really, but I do find that often times I need more than one standard module to do what should really be handled in one.

One example of this is the urllib and urllib2 modules which have an overlapping set of features and problems they address. As a result, you’ll often see these two modules used together. In my own code I’ve had to use both of these modules in addition to the cgi module, *and* the urlparse module. In Python 3, I’d only need urllib. Yay!

XML is another place where I’ve needed multiple modules, and again there has been some consolidation in Python 3.

In the end, this doesn’t get in the way of getting work done very much. It’s just a pattern you start to notice as you work through more of the standard library and use it for your own projects.

Missing Batteries

Python will sometimes surprise you with what’s included. There’s a json module, for example, and the sqlite3 module, both of which are nice to have. But while there’s a whole section of the library devoted to protocols, neither LDAP nor SNMP are represented, and I need both of them :(

It might also be nice to have more native file format support. YAML, for one, would be useful.

I also wrote a while back about how nice it would be to have a bash-like ‘select’ feature in Python. There’s already a cmd module which is a pretty good tool for creating interactive command line programs, but without a select-like feature, it’s a little limited.

At the end of the day…

No language provides every single tool that every single developer could possibly need, nor do they implement the tools they do have in the perfect way for all developers who will ever come along. I’ve written lots of code in Perl, PHP, and enough in Java, C++, and C to know that Python does a fantastic job at making my life easier as a developer, and I’m really encouraged by what I’m seeing in Python 3.

Panorama Theme by Themocracy