Category: Python

Released django-taxonomy on github

By m0j0, February 14, 2010 8:43 pm

Hi all,

I did a post several months ago about creating a generic taxonomy app for Django that was loosely coupled, unintrusive, and could evolve with an app that needed categories today, but then tags later, or labels later, or some other classification mechanism later. I wanted one app to just be generic enough to deal with it, so I created django-taxonomy…. and then did nothing with it.

Well, *I* did stuff with it, but I never put it anywhere where anyone else could do anything with it. So now I have: django-taxonomy is up on github. Please fork it and send me merge requests, because this app is not super high on my priority list, which is why I’m releasing it: it’ll get further and be more useful to you with the help of the community :)

Seeking Elegant Pythonic Solution

By m0j0, February 1, 2010 4:31 pm

So, I have some code that queries a data source, and that data source sends me back an XML message. I have to parse the XML message so I can store information from it into a relational database. So, let’s say my XML response looks like this:

<xml>
<response>
<results=2>
  <result>
    <fname>Brian</fname>
    <lname>Jones</lname>
    <gender>M</gender>
    <office_phone_ext>777</office_phone_ext>
    <mobile_phone>201-555-1212</mobile_phone>
  </result>
  <result>
    <fname>Molly</fname>
    <lname>Jones</lname>
    <home_phone>201-555-1234</home_phone>
  </result>
</results>
</xml>

So, as you can see, the attributes for each result returned for a query can differ, and if a result doesn’t have a value for some attribute, the corresponding xml element isn’t included at all for that result. If it were just 2 or 3 attributes, I could easily enough get around it by doing something like this:

def __init__(self, xmlresult):
  self.xmlresult = xmlresult
  if self.xmlresult.xpath('fname') is not None:
    self.fname = self.xmlresult.xpath('fname')
  if self.xmlresult.xpath('lname') is not None:
    self.lname = self.xmlresult.xpath('lname')

Like I said, if it were just a few things I needed to check for, I’d do it this way and be done with it. It’s not just a few though — it’s like 50 attributes. Now what?

I decided lxml.objectify would be a great way to go. It would allow me to access these things as object attributes, which should mean I can do something like this:

self.fname = getattr(self.xmlresult, 'fname', None)
self.lname = getattr(self.xmlresult, 'lname', None)
...

So, you *can* do this, technically speaking. Trouble is, you’re asking for an attribute of an ObjectifiedElement object, and when you do that, it returns an object that is not a native Python datatype, which I did not realize when I first started using lxml.objectify. So, in the above, ’self.fname’ will not be a Python string — it’ll be an lxml.objectify.StringElement object. Of course, my database driver, my ‘join()’ operations, and everything else in my code that relies on native Python datatypes is now broken.

What I actually need to do is get the ‘.pyval’ attribute of self.xmlresult.fname, if that attribute exists at all. So, something that does what I mean, which is “self.fname = getattr(self.xmlresult, ‘fname.pyval’, None). And, of course, doing ‘getattr(self.xmlresult, ‘fname’, None).pyval’ doesn’t work because None has no attribute ‘pyval’. I’ve tried a couple of other hacks too, but I’ve learned enough Python to know that if it feels like a hack, there’s probably a better way. But I can’t find that better way. Ideas?

What “Batteries Included” Means

By m0j0, January 22, 2010 10:15 pm

When I first got into Python, I read lots of blog posts that mentioned that Python was “the batteries included language”, but those same posts were short on any explanation of what that really meant. A few years and lots of projects later, I think I’m now qualified to at least give a beginner a basic understanding of what people mean when they say that.

What it means

Python has what’s called the “Standard Library”, which is a collection of modules to make some set of tasks within a particular problem domain simpler on you. The standard library modules are all part of the standard Python installation — you don’t have to add it to your installation. Here is a short list of things I’ve used Python for, and the standard library modules I used to get the work done:

  • Wrote a simple filesystem backup routine that works on my Linux and Solaris servers, as well as my Mac laptop. For this I used the os, stat, bz2, gzip, time, datetime, and tar modules. Oh – and optparse!
  • The first iteration of the loghetti Apache log file analysis tool used one external module, which I’m replacing now with optparse. The other modules used are urlparse, cgi, datetime, operator, re, sys, and mmap.
  • I’ve written a couple of simple web API clients using nothing more than urllib/urllib2, ElementTree, and various bits of the xml package (xml.minidom comes to mind).
  • I wrote a MySQL backup script using the sys, os, time, shutil, glob, tarfile, and optparse modules.

There are built-in modules for XML/HTML parsing, url parsing, network communications, threading and multiprocessing, image and audio manipulation, and lots of other tasks you’re likely to come across. The story of Python cannot be told from the standard library alone, but you can do an awful lot of work with what’s provided.

Why you care

You care because you want to write the code that solves the problem at hand without worrying about a whole lot of low-level details like socket communications and memory management. I mean, it’s nice to know that Python exposes those low-level details to you should you ever get a wild hair, but if you frequently had occasion to really need that, you’d probably code in C.

Perhaps ironically, you also care about the batteries included because of what it means for those batteries that *aren’t* included. Chances are the Python-driven applications and external modules you wind up using make very heavy use of the standard library modules, making the code for those add-ons simpler to read and understand. It has the potential to make them more reliable and consistent as well.

I’d rather see a threaded application using Python’s threading module rather than trying to code the threading implementation by hand. I’d rather see a Python web server using some descendant of Python’s socket module than coding socket operations by hand. Having things like socket and threading in the standard library means that tools across various problem domains that happen to require certain common functionality work in a standard way where that functionality is concerned.

The same holds true for your own code. If you need to write a Jabber messaging server in Python, and six months later you need to write a queue-based networked job dispatching server, you’re going to be using similar modules, and you might even be able to reuse some of your old Jabber server code. Reuse for the win!

Remember LEGO? I was obsessed with LEGO. I remember becoming so intimately familiar with every type of block, window, platform, person, light post, wheel, and car base that envisioning my own custom-made Grand Galaxy Cruiser was easy, and actually building it (from pieces of probably 10 different LEGO sets) was only slightly more difficult. The standard library, in essence, is your LEGO set. Really, instead of “batteries included” Python’s mantra could well be “There’s a block for that” :)

It’s not all beer and skittles

Even in LegoLand there are pieces that don’t fit together the way you’d like. I wanted to build a house once with car windshields in place of windows. Turns out it’s not so simple.

Python 3 improves things a lot in terms of how the standard library is organized, but if you have an external module your code depends on, it might not be ready for Python 3, which means (like me and many others), you’re stuck in Python 2-land for a while. It’s not a big deal really, but I do find that often times I need more than one standard module to do what should really be handled in one.

One example of this is the urllib and urllib2 modules which have an overlapping set of features and problems they address. As a result, you’ll often see these two modules used together. In my own code I’ve had to use both of these modules in addition to the cgi module, *and* the urlparse module. In Python 3, I’d only need urllib. Yay!

XML is another place where I’ve needed multiple modules, and again there has been some consolidation in Python 3.

In the end, this doesn’t get in the way of getting work done very much. It’s just a pattern you start to notice as you work through more of the standard library and use it for your own projects.

Missing Batteries

Python will sometimes surprise you with what’s included. There’s a json module, for example, and the sqlite3 module, both of which are nice to have. But while there’s a whole section of the library devoted to protocols, neither LDAP nor SNMP are represented, and I need both of them :(

It might also be nice to have more native file format support. YAML, for one, would be useful.

I also wrote a while back about how nice it would be to have a bash-like ’select’ feature in Python. There’s already a cmd module which is a pretty good tool for creating interactive command line programs, but without a select-like feature, it’s a little limited.

At the end of the day…

No language provides every single tool that every single developer could possibly need, nor do they implement the tools they do have in the perfect way for all developers who will ever come along. I’ve written lots of code in Perl, PHP, and enough in Java, C++, and C to know that Python does a fantastic job at making my life easier as a developer, and I’m really encouraged by what I’m seeing in Python 3.

Intro to Python 101 For Beginners

By m0j0, January 21, 2010 10:54 pm

If you code Python already, go somewhere else. I’m only talking to complete and total newbies to the language right now. I want to show them the stuff that I wished someone had put in one nice, neat blog post for easy consumption when I got started with the language. If that’s what you’re looking for, look no further. Here’s what you need to know.

Lay of the Land

Python seems like a bit of a strange place at first. Most new programming languages do. This is because most modern interpreted languages like Ruby and Python like to “eat their own dog food”, so Ruby’s web site uses a Ruby-based CMS, and the documentation is generated with Ruby-based tools (I believe it’s all RDoc). Python’s site is written in Python, but isn’t a specific framework. The documentation is presented using Sphinx, a documentation engine written in Python.

Python, at time of writing, has a 2.x distribution and a 3.x distribution available for download. There’s a chance I’ll be flamed for saying so, but download 2.x. Most of the blog posts you’re likely to turn to for help over the course of your first year with Python are still going to be 2.x-specific, and if your code involves external modules, a good number of them haven’t ported to 3.x yet. Hang tight – 3.x is awesome, but 2.x pretty much rocks too.

Documentation for Python is pretty good, and the first bit of documentation you need (after this post) is the Python Tutorial. It’ll get you rolling with the basic syntax of the language, data types, equality operators, conditionals, the works.

With the basics of the language under your belt, you need two other bits of documentation:

You’ll want to know about the modules that are included with the Python distribution before you go out seeking external modules. You can see a complete list of built-in modules at the Module Index. It’ll give you a good idea of what you can do with the language right out of the gate without any additional downloads.

You’ll also want to bookmark the Standard Library documentation. This is the meaty stuff, and is where I spend most of my doc-reading hours. If you know the module you want docs on, just type it on the end of the base url and you’re there. So if you want to know about the ‘threading’ module, go to http://docs.python.org/library/threading

One tip about the documentation on python.org: don’t use the search box they provide. The search functionality is slow, and after all that time, it’s almost certainly *not* going to give you what you were looking for. Go to Google and search there. If you know that what you’re looking for is on docs.python.org, then append “site:docs.python.org” to your Google search term. Works like a charm.

How to do ‘x’ with Python

The first project I wanted to use Python for involved LDAP, and there’s no LDAP-related module built into Python. Finding the right module to use was my first challenge. There used to be a resource known as the ‘Cheese Shop”, but it’s now been rolled into the Python Package Index, a.k.a “PyPI”. These packages are not endorsed by python.org or anything, PyPI is just an interface to help you figure out what the available options are. The search box on this page *is* useful, but the issue then becomes which module to choose — even for a relatively obscure requirement like “LDAP” there are lots of modules.

The obvious choice for LDAP is python-ldap, and it seems to be the canonical choice for those needing this support. Also, a Google search for “python and ldap” returns several articles about, and the home page for, the python-ldap module. So, in less than five minutes you can usually figure it out, but don’t download the module yet!

You can (it’s completely optional but worth it) install “setuptools”, which gives you a tool called “easy_install” that’ll run out to PyPI and get whatever module you want, and install it on your system. I install most modules this way and don’t remember ever having any major issues with it. That said, some of the cool kids lately have taken a liking to an easy_install replacement called pip. I admit it looks very nice, but I guess until easy_install bites me I’m not going to be super motivated to switch. Pip takes some additional steps to make it more reliable and cause fewer issues in the face of spotty wireless connectivity and things like that. It also tries to make output easier to understand, it supports package uninstalls, and several other niceties. It seems to be the future.

Community Support

Python coders are a fairly outspoken, though friendly lot, in my experience. If you ask a reasonable question, you’ll get a reasonable answer, often in the form of a link to where you need to be documentation-wise, but not with the nasty RTFM aftertaste other communities leave behind. For added comfort, attach a “link to docs welcome!” to your request for help. Yummy!

There are lots of Python mailing lists, the most helpful one for beginners (and beyond) is probably the tutor list. I learn something new every day lurking on that list. The answers you get there tend to be authoritative, from folks who are working on the Python language itself, so it’s not just another ‘net forum where the blind often times are leading the blind.

Two great sources for ideas on how to put your code together or help venturing into new territory with Python are Stack Overflow and the Python Cookbook. Python Cookbook is a great source for ideas, and Stack Overflow is a really good Q&A interface where you can get good advice to hard problems (or really easy ones).

There are IRC channels for Python, and I’m a heavy user of IRC in general, but I don’t venture into the Python IRC channels much. With all of the other resources available, I don’t typically miss it.

Ok, wtf does this traceback mean?

Tracebacks can look daunting if you’re new to the language (unless you come from a Java background). They’re not all that tough. Here’s some output from a Python interpreter interactive session (which you’ll learn about if you read the tutorial linked above):

>>> d
{'China': 'Shanghai', 'USA': 'NY'}
>>> d['USA']
'NY'
>>> d['foo']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'foo'
>>>

This is a short traceback, but you parse them in the same basic manner as long ones 90% of the time. The first thing I look at is the last line of the traceback, which tells me the kind of error that was encountered. In this case it was a KeyError. I’ve seen error at least once for every hair left on my head, but if you don’t know what it means, check out the handy dandy page that describes all of Python’s built-in exceptions! Exceptions are just Python code themselves, and you can create your own for your own use in your own code, by extending one of the existing exceptions, or the base Exception class.

After about a month of coding in Python regularly, your eyes will start to jump to the right places in a Python traceback, and you’ll find that the exception names actually are pretty intuitive, and since you can employ them yourself or extend them, you’ll probably get to know them quickly without using a reference.

Really? Whitespace?

You’ve probably heard about Python’s treatment of whitespace before. Its reputation for placing significance on whitespace in source code precedes the language, and I guess it turns some people off. I actually didn’t care, because I figured as long as I indented my code the way I normally did, the whitespace issue wouldn’t bother me much, and I was right.

If you code in Perl, C, or Java, they don’t require indentation, but you almost certainly indent anyway because the whitespace is significant to *you* even if it’s not significant in the language. If you write readable code, nicely indented, you’ll probably forget about Python’s whitespace requirements after your first hour coding in it.

The rules aren’t all that stringent either. This code works fine:

def foo(c):
                     return c**2

if __name__ == "__main__":
         print foo(2)

There’s no consistency in this code: my function indents like 16 spaces or something, and then my if statement at the bottom doesn’t. All Python wants you to do is be consistent: if line 1 of your function is indented 4 spaces, indent the rest of them 4 spaces. Now you’re probably thinking “Well who doesn’t do that?” Exactly. It’s not draconian, it’s practical. You could think of Python’s whitespace requirement as filtering out people you’d rather not code beside anyway :)

Books

I feel like I own most of the Python books in existence through 2008. I bought the printed library reference in (I think) 2000. The best ones?

The one that really got me going with Python was Dive Into Python, which is available free and readable online. I bought the printed book just to support the author’s work. Mark Pilgrim writes using a style that makes you feel like you’re sitting next to him at a conference in a really lame talk and he’s whispering all kinds of programming goodness at you and showing you programming porn on his laptop screen. He’s excited to be talking to you about Python in that book. Truly inspired.

Perhaps my favorite “huge tome”-style book that covers pretty much everything from beginning to advanced material is Core Python by Wesley Chun. He covers more topics more deeply than most other books. I half expected this to be very broad and not very deep coverage, but I still pick that book up quite often when I want to understand some feature of the language better. His explanations are, for some reason, more clear and concise than most. Maybe they just fit my brain better.

After that I went straight into the more specialized Python books, but those are mostly dated now (Python Network Programming, for one, could use an update. I’d buy it again, probably), though I did purchase Programming Python, and Python Cookbook as a sort of quick reference for doing things I’d never done before. I don’t use them much, to be honest.

The only other book I really got excited about was Tarek Ziade’s Expert Python Programming. It’s not a beginner’s book, but it has a lot of really great material in it for those building larger applications with Python. Highly recommended.

Must Read

If you’ve worked in another language, it probably goes without saying, but you’ll learn much more quickly if you can read some source code, and/or work on a project with more senior coders than yourself. Reading code and tweaking it is probably the fastest way to learn about any language.

Move Forward

I hope this puts you in the right direction with Python. Bookmark all of those links, follow the tutorial, and before you know it you’ll be master of all you survey.

Enjoy!

If You Don’t Date Your Work, It Sucks.

By m0j0, January 18, 2010 8:39 am

I probably get more upset than is reasonable when I come across articles with no date on them. I scroll furiously for a few minutes, try to see if the date was put in some stupid place like the fine print written in almost-white-on-white at the bottom of the post surrounded by ads. Then I skim the article looking for references to software versions that might clue me in on how old this material is. Then I check the sidebars to see if there’s some kind of “About this Post” block. Finally, I make a mental note of the domain in a little mental list I use to further filter my Google searches in the future. Then I close the browser window in disgust. If it weren’t completely gross and socially unacceptable to do so, I would spit on the floor every time this happened.

Why would you NOT date your articles? In almost every single theme for every single content management solution written in any language and backed by any database, “Date” is a default element. Why would you remove it? It is almost guaranteed to be more work to remove it. Why would you go through actual work to make your own writing less useful to others?

What happens when you don’t date your articles?

  1. People have no idea whether your article has anything to do with what they’re working on.  If you wrote an article about the Linux kernel in 1996, it’s of no use to me *now*, even if it was pretty hardcore at the time.
  2. Readers are forced to skim your article looking for references to software versions to see if your article is actually meaningful to them or not. Why make it hard for people to know whether your article is useful? The only reason I can think of is that you already know your articles are old, so not dating them insures that people at least skim enough to see some of the ads on your site. You are irreversibly lame if you do this.
  3. It causes near seizures in people like me who really hate when you don’t date your work, as well as all of your past teachers, who no doubt demanded that you sign and date your work.
  4. Every time you don’t date an article online, a seal pup is clubbed to death in the arctic, and a polar bear gets stranded on a piece of ice.

At some point, I will make an actual list of web sites that regularly do not date their work. A sort of hall of shame for sites that fail to link their writing to some kind of time-based context. If you have sites you’d like to add, let me know in the comments.

Head first into javascript (and jQuery)

By m0j0, January 12, 2010 8:56 pm

So, I had to take a break from doing the Code Katas just as I was getting to the really cool one about Bloom Filters. The reason for the unexpected break from kata-ing was that I had a project thrown into my lap. I say “project” not because it was some huge thing that needed doing — lots of people reading this could probably have done it in a matter of a few hours — but because it involved two things I’ve never done any real work with: javascript, and jQuery.

My task? Well, first I had to recreate a page from a graphic designer’s mockup. So, take a JPEG image and create the CSS and stuff to make that happen. Already I’m out of my comfort zone, because historically I’m a back-end developer more comfortable with threading than CSS (95% of my code is written in Python and is daemonized/threaded/multiprocess/all-of-the-above or worse), but at least I’ve done enough CSS to get by.

Once the CSS was done, I was informed that I’d now need to take the tabular reporting table I just created and make it sort on any of the columns, get the data via AJAX calls to a flat file that would store the JSON for the dummy data, create nice drop-down date pickers so date ranges in the report could be chosen by the end user, page the data using a flickr-style pager so only 20 lines would show up on a page, and alternate the row colors in the table (and make sure that doesn’t break when you sort!).

How to learn javascript and/or jQuery REALLY fast

How exactly do you learn enough javascript and jQuery to get this done in a matter of a few days (counting from *after* the CSS part was done)? Here are some links you should keep handy if you have a situation like this arise:

  • If Douglas Crockford says it, listen. I’d advise you start here (part I of a 4-part intro to javascript). That site also has his ‘Advanced Javascript’ series. He also wrote a book, which is small enough to read quickly, and well done.
  • Packt has a lot of decent resources for jQuery. Specifically, this article helped me organize what I was doing in my head. The code itself had some rather glaring issues — you’re not going to cut-n-paste this code and deploy it to production, but coming from scorched earth, I really learned a lot.
  • After the project was already over I found this nice writeup that covers quick code snippets and demos illustrating some niceties like sliding panels and disappearing table rows and how to do them with jQuery.
  • jQuery itself has some pretty decent documentation for those times when your cut-n-pasted code looks a little suspect or you’re just sure there’s a better way. Easy to read and concise.

Why I Wrote My Own Sorting/Paging in jQuery

Inevitably, someone out there is wondering why I didn’t just use tablesorter and tablesorter.pager, or Flexigrid, or something like that. The answer, in a nutshell, is paging. Sorting and paging operations, I learned both by experience and in my reading, *NEED* to know about each other. If they don’t, you’ll get lots of weird results, like sorting on just one page (or, sorting on just one page until you click to another page, which will look as expected, and then click back), or pages with rows on them that are just plain wrong, or… the list goes on. This is precisely the problem that the integrated “all-sorting-all-paging” tools like tablesorter try to solve. The issue is that I could not find a SINGLE ONE that did not have a narrow definition of what a pager was, and what it looked like.

I wanted (well, I was required to mimic the mockup, so “needed”) a flickr-style pager — modified. I needed to have each page of the report represented at the bottom of the report table by a block with the proper number in the block. The block would be clickable, and clicking it would show the corresponding page of data. This is more or less what Flickr does, but I didn’t need the “previous” and “next” buttons, and I didn’t need the “…” they use (rather effectively) to cut down on the number of required pager elements. So… just some blocks with page numbers. That’s it.

I started out using tablesorter for jQuery, and it worked great — it does the sorting for you, manages the alternating row colors, and is a pretty flexible sorter. Then I got to the paging part, and things went South pretty fast. While tablesorter.pager has a ‘moveToPage’ function, it’s not exposed so you can bind it to a CSS selector like the ‘moveToPrevious’, ‘moveToLast’, ‘moveToNext’ and other functions are. So, I tried to hack it into the pager code myself. I got weird results (though I feel more confident about approaching that now than I did even three days ago). There wasn’t any obvious way to do anything but give the user *only* first/last/previous/next buttons to control the paging. I moved on. I googled, I asked on jQuery IRC, I even wrote the developer of tablesorter. I got nothing.

I looked at 4 or 5 different tools, and was shocked to find the same thing! I didn’t go digging into the code of all of them, but their documentation all seemed to be in some kind of weird denial about the existence of flickr-style paging altogether!

So, I wrote my own. It wasn’t all that difficult, really. The code that worked was only slightly different from the code I’d fought with early on in the process. It just took some reading to get some of the basic tricks of the trade under my belt, and I got a tip or two from one of the gurus at work as well, and I was off to the races!

Lessons Learned

So, one thing I have to say for my boss is that he knows better than to throw *all* of those things at me at once. Had he come to me and said he wanted an uber-ajaxian reporting interface from outer space from the get-go, I might not have responded even as positively as I did (and I would rate my response as ‘tepid, but attempting a positive outlook’) . It’s best to draw me in slowly, a task at a time, so I can get some sense of accomplishment and some feedback along the way instead of feeling like I still have this mountain to climb before it’s over.

I certainly learned that this javascript and jQuery (and AJAX) stuff isn’t really black magic. Once you get your hands dirty with it it’s kinda fun. I still don’t ever want to become a front end developer on a full-time basis (browser testing is where I *really* have zero patience, either for myself or the browsers), but this experience will serve me well in making my own projects a little prettier and slicker, and nicer to use. It’ll also help me understand more about what the front end folks are dealing with, since there’s tons of javascript at myYearbook.

So, I hope this post keeps some back end scalability engineer’s face from turning white when they’re given a front-end AJAX project to do. If you’ve ever had a similar situation happen to you (not necessarily related to javascript, but other technologies you didn’t know until you were thrown into a project), let’s hear the war stories!!

2009 Python Meme

By m0j0, December 29, 2009 9:58 pm

Heard about this from Tarek, and you can find more of them on Planet Python (where I found Tarek’s post).

  1. What’s the coolest Python application, framework or library you have discovered in 2009?
  2. Probably Tornado. Tornado is an interesting application, because it blurs the line a bit between a framework like Django and a traditional web server. If you can picture it, it’s a barebones, lightweight, almost overly simplified Django, with a production-ready web server instead of Django’s built-in dev server. In reality, Tornado is (or feels) more integrated than that, but that leads to some interesting issues on its own.

    Still, it’s been a heckuva lot of fun to play with. One thing that always concerned me about Django was the ORM. It’s fine for my little hobby website, or a simple wiki for my wife and I to use, and even some slightly more complex applications, but if you have a database-driven site that serves “lots and lots” of users and it needs to manage complex relationships and never slow down…. I don’t trust the ORM to do that. What’s more, I’m actually pretty skilled in data modeling, database administration, etc., and I understand abstraction. I don’t really require Django’s models (though, again, I love Django for doing low-traffic sites very quickly).

    Playing with web frameworks is a lot of fun, and if you’ve played with a few, you’ll like the “clean slate” that Tornado provides you to mix-n-match your favorite features of the frameworks you’ve used. I’ve done some hacking around Tornado to provide some generic facilities I’m likely to use in just about every project I use Tornado for. The sort of pseudo-framework is available as Tornado-project-stub on github.

  3. What new programming technique did you learn in 2009 ?
  4. Thread and process pool management. Whereas in previous roles I focused on optimization at the system and network level by testing/deploying new tools, poking at new paradigms, or just reworking/overhauling things that were modeled or configured suboptimally, my new role is something you really would call “scalability engineering”. I believe everything I’m involved in at the moment involves the words “distributed”, “asynchronous”, “multithreaded”, “multiprocess”, and other terms that imply “fast” (for varying definitions of fast, depending on the project).

    Though I’ve had to understand threading and multiprocessing (and microthreads and other stuff too) in the past, and I’ve even written simple threaded/multiprocessing stuff in the past, I’m now knee deep in it, and am getting into more complex scenarios where I really *need* a pool manager, and would really *like* to pass objects around on the network. Happily, I’m finding that Python has facilities for all of this built-in.

    Aside from that, I have to say that while most of what I’m doing now doesn’t involve techniques I’ve never heard of, I’m really reveling in the opportunity to put them into practice and actually use them. Also, since I now code full-time, I find the ability to code doesn’t ever escape my brain. I can code fast enough now that I can implement something two or three different ways to compare the solutions in no time!

  5. What’s the name of the open source project you contributed the most in 2009 ? What did you do ?
  6. Actually, it’s not released yet, but I’ve almost completed a rewrite of Golconde, a queue-based replication engine. I was able to make the queuing protocol, the message processor (the thing that takes queued messages and turns them into database operations), and the database backend swappable. Golconde was written to support STOMP queuing and PostgreSQL specifically. I’ve already used STOMP and AMQP interchangeably in the rewrite, and I’m now on to swapping out the database and message processor components, documenting how users can create their own plugins for these bits along the way.

    Golconde is also threaded. My rewrite is currently threaded as well, but I’m going to change out the threads in favor of processes, because the facilities that can help the project moving forward (in the short term, even) come more from multiprocessing than from threading. One thing I’ve already accomplished is refactoring so that there need be only a single thread class, which makes using worker pools more convenient and natural. It’s coming together!

  7. What was the Python blog or website you read the most in 2009 ?
  8. I read Planet Python every day, and keep up with Python Reddit as well. Besides aggregators, Corey Goldberg and Jesse Noller seem to overlap my areas of interest a lot, so I find myself actually showing up at their blogs pretty often. Neither of them blog enough, imo ;-)
  9. What are the three top things you want to learn in 2010?
    1. Nose – because I want to become so good at testing that it makes me more productive, not less. Right now I’m just clumsy at testing, and I come across situations that I just plain don’t know how to write a test for. I need to know more about writing mock objects and how to write tests for threaded/multiprocessing applications. I know enough to think that Nose is probably the way to go (I’ve used both unittest and doctest before, so I’m not totally ‘green’ to the notion of testing in general), but I haven’t been able to work it into my development process yet.
    2. Erlang – There doesn’t seem to be a language that makes concurrency quite as brainless as Erlang. That said, learning the language and OTP platform is *not* brainless.
    3. Sphinx – I hear all the cool kids are using it. Some people whose opinions I trust love it, but I have some reservations based on my experience with it. The one thing that gives me hope is that Django’s documentation site, which I like the interface and features of, uses it.

CodeKata 4: Data Munging

By m0j0, December 28, 2009 11:10 pm

I’m continuing to take on the items in Dave Thomas’s Code Kata collection. It’s a nice way to spend a Sunday night, and it’s a good way to get my brain going again before work on Monday morning. It’s also fun and educational :)

CodeKata 4 is called “Data Munging”. It’s not very difficult data munging, really. I think the more interesting bit of the Kata is how to deal with files in different languages, what tools are well suited to the task, and trying to minimize code duplication.

Description

Code Kata 4 instructs us to download weather.dat, a listing of weather information for each day of June, 2002 in Morristown, NJ. We’re to write a program that identifies the day which has the smallest gap between min and max temperatures.

Then, it says to download football.dat, a listing of season statistics for Soccer teams. We’re to write a program that identifies the team with the smallest gap between goals scored for the team, and goals scored against the team.

Once those are done, we’re asked to try to factor out as much duplicate code as possible between the two programs, and then we’re asked a few questions to help us think more deeply about what just transpired.

My Attack Plan

The first thing that jumped into my brain when I saw the data files was “Awk would be perfect for this”. I fiddled with that for a little too long (my awk is a little rusty), and came up with this (for weather.dat):

>$ awk 'BEGIN {min=100000}; $1 ~ /^[1-3]/ {x[$1]=$2-$3; if (x[$1]<min){ min=x[$1]; winner=$1}} END {print winner, min} ' weather.dat.dat
14 2

It works, and awk, though it gets ugly to some, reads in a nice, linear way to me. You give it a filter expression, and then statements to act on the matching lines (in braces). What could be simpler?

After proving to myself that I hadn’t completely lost my awk-fu, I went about writing a Python script to deal with the problem. I read ahead in the problem description, though, and so my script contains separate blocks for the two data sets in one script:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()

if 'weather' in sys.argv[1]:
   winner = 1000000
   winnerday = None

   for line in data:
      #filter out lines that aren't numbered.
      if line.strip().startswith(tuple(string.digits)):
         # we only need the first three fields to do our work
         l = line.split()[:3]
         # some temps have asterisks attached to them.
         maxt = l[1].strip(string.punctuation)
         mint = l[2].strip(string.punctuation)
         diff = int(maxt) - int(mint)
         if diff < winner:
            winner = diff
            winnerday = l[0]
   print "On day %s, the temp difference was only %d degrees!" % (winnerday, winner)

if 'football' in sys.argv[1]:
   winner = 1000000
   winnerteam = None

   for line in data:
      if line.strip().startswith(tuple(string.digits)):
         l = line.split()
         team, f, a = l[1], int(l[6]), int(l[8])
         diff = abs(f - a)
         if diff < winner:
            winner = diff
            winnerteam = team
   print "Team %s had a for/against gap of only %d points!" % (winnerteam, winner)

Really, the logic employed is not much different from the awk solution:

  1. Set a default for ‘winner’ that’s unlikely to be rivaled by the data :)
  2. Set the default for the winning team or day to ‘None’
  3. Filter out unwanted lines in the dataset.
  4. Grab bits of each line that are useful.
  5. Assign each useful bit to a variable.
  6. Do math.
  7. Do comparisons
  8. Present results.

Refactoring

Part 2 of the Kata says to factor out as much duplicate code as possible. I was able to factor out almost all of it on the first shot at refactoring, leaving only the line of code (per file) to identify the relevant columns of each data set:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()
winner_val = 1000000
winner_id = None

for line in data:
   if line.strip().startswith(tuple(string.digits)):
      l = line.split()
      if 'weather' in sys.argv[1]:
         identifier, minuend, subtrahend = l[0], int(l[1].strip(string.punctuation)), int(l[2].strip(string.punctuation))
      elif 'football' in sys.argv[1]:
         identifier, minuend, subtrahend = l[1], int(l[6]), int(l[8])
      diff = abs(minuend - subtrahend)
      if diff < winner_val:
         winner_val = diff
         winner_id = identifier

print winner_id, winner_val

Not too bad. I could’ve done some other things to make things work differently: for example I could’ve let the user feed in the column header names of the ‘identifier’, ‘minuend’ and ’subtrahend’ columns in each data set, and then I could just *not* parse out the header line and instead use it to identify the list index positions of the bits I need for each line. It’d make the whole thing ‘just work’. It would also require more effort from the user. On the other hand, it would make things “just work” for just about any file with numbered lines of columnar data.

I have to admit that the minute I see columnar data like this, awk is the first thing I reach for, so I’m sure this affected my Python solution. The good news there is that my thinking toward columnar data is consistent, and so I treated both files pretty much the same way, making refactoring a 5-minute process.

In all, I enjoyed this Kata. Though I didn’t take it as far as I could have, it did make me think about how it could be improved and made more generic. Those improvements could incur a cost in terms of readability I suppose, but I think for this example it wouldn’t be a problem. I’m working on a larger project now where I have precisely this issue of flexibility vs. readability though.

I’m reengineering a rather gangly application to enable things like pluggable… everything. It talks to networked queues, so the protocol is pluggable. It talks to databases, so the database back end is pluggable, in addition to the actual data processing routines. Enabling this level of flexibility introduces some complexity, and really requires good documentation if we reasonably expect people to work with our code (the project should be released as open source in the coming weeks). Without the documentation, I’m sure I’d have trouble maintaining the code myself!

PyYaml with Aliases and Anchors

By m0j0, December 22, 2009 8:10 am

I didn’t know this little tidbit until yesterday and want to get it posted so I can refer to it later.

I have this YAML config file that’s kinda long and has a lot of duplication in it. This isn’t what I’m working on, but let’s just say that you have a bunch of backup targets defined in your YAML config file, and your program rocks because each backup target can be defined to go to a different destination. Awesome, right?

Well, it might be, but it might also just make your YAML config file grotesque (and error-prone). Here’s an example:

Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2

Now with two backups, this isn’t so bad. But if your environment has 100 backup targets and only one destination, or…. heck — even if there are three destinations — should you have to write out the definition of those same three destinations for each of 100 backup targets? What if you need to change how one of the destinations is connected to, or the name of a destination changes, or array23 dies?

Ideally, you’d be able to reference the same definition in as many places as you need it and have things “just work”, and if something needs to change, you just change it in one place. Enter anchors and aliases.

An anchor is defined just like anything else in YAML with the exception that you get to label the definition block using “&labelname”, and then you can (de)reference it elsewhere in your config with “*labelname”. So here’s how our above configuration would look:

BackupDestination-23: &Backup_To_ARRAY23
    host: bar
    dir: /mnt/array23/homes/jonesy
    check_space: true
    min_space: 80G
    num_archives: 4
    compress: bzip2
Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23

With only two backup targets, the benefit is small, but keep trying to imagine this config file with about 100 backup targets, and only one or two destinations. This removes a lot of duplication and makes things easier to change and maintain (and read!)

The cool thing about it is that if you already have code that reads the YAML config file, you don’t have to change it at all — PyYaml expands everything for you. Here’s a quick interpreter session:

>>> import yaml
>>> from pprint import pprint
>>> stream = file('foo.yaml', 'r')
>>> cfg = yaml.load(stream)
>>> pprint(cfg)
{'BackupDestination-23': {'check_space': True,
                          'compress': 'bzip2',
                          'dir': '/mnt/array23/homes/jonesy',
                          'host': 'bar',
                          'min_space': '80G',
                          'num_archives': 4},
 'Backups': {'Home_Jonesy': {'Destination': {'check_space': True,
                                             'compress': 'bzip2',
                                             'dir': '/mnt/array23/homes/jonesy',
                                             'host': 'bar',
                                             'min_space': '80G',
                                             'num_archives': 4},
                             'dir': '/Users/jonesy',
                             'host': 'foo',
                             'keyloc': '~/.ssh/id_rsa.pub',
                             'protocol': 'ssh'},
             'Home_Molly': {'Destination': {'check_space': True,
                                            'compress': 'bzip2',
                                            'dir': '/mnt/array23/homes/jonesy',
                                            'host': 'bar',
                                            'min_space': '80G',
                                            'num_archives': 4},
                            'dir': '/Users/molly',
                            'host': 'eggs',
                            'keyloc': '~/.ssh/id_rsa.pub',
                            'protocol': 'sftp'}}}

…And notice how everything has been expanded.

Enjoy!

Python Code Kata 2: Karate Chop (Binary Search)

By m0j0, December 13, 2009 2:59 pm

I’ve decided it’d be fun to go through the Code Kata put together by Dave Thomas, co-author of The Pragmatic Programmer, and co-founder of Pragmatic Programmers, LLC (publishers of the finest tech books on the market right now; they can’t expand their selection fast enough for me).

Anyway, the first Code Kata, while being truly interesting to ponder, doesn’t really involve code, so I’m starting with Kata 2, which is a binary search exercise. It occurred to me while reading it over that I’d never written a binary search in Python, because you typically don’t need to, since list objects have a method called ‘index()’ which will hand you the index position of your search target within the list. I decided to force myself not to rely on it, as an exercise. If you’ve never had to go through a computer science curriculum, but want to excel at programming, I suggest doing all of the Kata, and getting your hands on at least one good computer science text (a computer science grad student hanging around is nice, too!)

Binary Search Background

So, the idea behind a binary search is often likened to looking up a name in a phone book, which is a pretty good analogy. A phone book has thousands of pages and perhaps millions of names, and yet it doesn’t take us long to find any given name, no matter which part of the book it’s in. You open the book to the middle, and the name you’re looking for is either on one of the facing pages, in the first half of the book, or in the last half of the book. After glancing at a single page, you’ve reduced your haystack by 50%. You flip to the middle of the proper half of the book, and repeat the process. If you had a phone book listing every person in the entire world (say, 7,000,000,000 people), you could find any given person in 33 steps or less. That’s pretty efficient (in big O notation, it’s O(logn)).

You can use a binary algorithm to tease your friends, too. Bet a friend $1 that you can guess any number they pick between 1 and 100 in 7 guesses if they tell you if your guess is too low or too high. If you use a binary algorithm to make your guesses, it’s not possible to lose this bet as far as I know. Here’s an example:

Your buddy chooses number 70. Your guesses will be:

  1. 50 (too low)
  2. 75 (too high)
  3. 62 (too low)
  4. 69 (too low)
  5. 72 (too high)
  6. 71 (too high)
  7. 70 (that’s it!)

One more. Your buddy chooses the number 1. Your guesses will be:

  1. 50 (too high)
  2. 25 (too high)
  3. 13 (too high)
  4. 7 (too high)
  5. 4 (too high)
  6. 2 (too high)
  7. 1  (that’s it!)

But what if your buddy wants to up the stakes by increasing the scope to numbers between 1 and 1000? Say he picks 485:

  1. 500 (too high)
  2. 250 (too low)
  3. 375 (too low)
  4. 438 (too low)
  5. 469 (too low)
  6. 485 (that’s it!)

You got lucky on that last one, but you might’ve negotiated more guesses for yourself. To do that and still guarantee a win, figure out what log2N is, where N is the highest number they can choose. So, log2100 = 7, and log21000 = 10. Another tip: when calculating your next guess, always round your division UP. Notice that log27000000000 = 32.7, which is where I got the max of 33 steps above.

Ok, now on to the Kata!!

Binary Search in Python Take 1: Traditional Binary Search

This is just my take on a traditional binary search algorithm in Python:

def chop(needle, haystack):
   """
   Takes an integer target (needle) and a sorted list of integers (haystack),
   and performs a binary search for the target. Return the integer index of the
   target in the array, or -1 if it's not found.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

This makes assumptions. Namely, it assumes a 0-indexed, sorted list with no missing values. So, my test lists are always output from range(N), where N is the number of elements I want to search. It works well enough. I think a CS prof would accept this. Let’s move on.

What if I break it in *THREE* (oooooohhhhhh)

If breaking the list in two is so cool, why don’t we break it in three pieces? We can do one extra check, and then cut out 2/3 of the haystack in one fell swoop! Ok, let’s see how it goes:

def chop3(needle, haystack):

   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      brk1 = (lo+hi)/3
      brk2 = (2 * (lo+hi)) / 3
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2

This one attempts to cut the initial array into three parts instead of bisecting. This won’t work, actually. At some point, you’ll shrink the list down two where the lower bound is 1/2 the upper bound, at which point, all hell breaks loose. Think about how this works when lo is 16 and hi is 32. 16+32 = 48, and 48/3 = 16! This means brk1 never actually changes, and you’ve created perfect conditions for an infinite loop.

Instead, define the hi and lo variables and break points in the list differently:

def chop3_take2(needle, haystack):
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      brk1 = lo+incr
      brk2 = lo + (incr * 2)
      val1 = haystack[brk1]
      val2 = haystack[brk2]
      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2
      elif needle == val1:
         break
      elif needle == val2:
         break

This one will work. It’s slightly more heavy lifting than chop2, but not too bad. In the calculation for the brk variables, I’ve decided to calculate an increment defined as 1/3 the distance between lo and hi. So, if lo=16 and hi=32, instead of brk1 becoming 16 and looping forever, it becomes the point 1/3 of the way between lo and hi. It makes fewer assumptions, which in my experience means fewer bugs. (32-16)/3 == 5 (the remainder in integer division is truncated), so brk1 is going to be 21 instead of 16.

From simple tests, it doesn’t appear that this extra bit munging buys you anything. On a list with 1 million elements, chop() and chop3_take2() run in almost the exact same amount of time. Increasing the list size to 10 million elements starts to show evidence that the initial chop() routine is going to scale better than chop3_take2(). I think chop() is also nicer to read and understand, which makes it a better candidate for production code.

Combine the two?

I sorta wondered what would happen if I only did the 2/3 reduction on the inital pass, and then passed the part of the array where the target exists to chop(). The function would look a lot like chop3_take2, but inside the if…elif parts, instead of setting new vals for lo or hi, we’d pass the part of haystack bounded by lo and hi to chop().

A nice idea except that, if you recall, chop() assumes a zero-indexed array, and our chop3_take2() function sets new values for lo in some instances, so what we really need is a chop_non_zero_index() function. Here’s one:

def chop_non_zero_index(needle, haystack):
   lo = 0
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

And then we pass portions of our haystack to chop_non_zero_index() from a function called chop_salad(). The chop_salad() function does an initial pass to cut out 2/3 of the candidates, then delegates to chop_non_zero_index() for the rest.

def chop_salad(needle, haystack):
   """Same as take 1 above, but this time we pass to chop_non_zero_index
   after the first pass.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      print "increment = %d" % incr
      brk1 = lo+incr
      print "brk1 = %d" % brk1
      brk2 = lo + (incr * 2)
      print "brk2 = %d" % brk2
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         chop_non_zero_index(needle, haystack[:brk1+1])
         break
      elif needle > val1 and needle < val2:
         chop_non_zero_index(needle, haystack[brk1:brk2+1])
         break
      elif needle > val2:
         chop_non_zero_index(needle, haystack[brk2:])
         break
      elif needle == val1:
         print "Found it @val1!"
         break
      elif needle == val2:
         print "Found it @val2!"
         break

At this point, we’re getting into doing more list manipulations, more comparisons, passing stuff between functions… not nearly as clean and simple as plain old chop(), and it doesn’t buy you anything in terms of performance either.

Extend the List Object, and using Recursion

I almost never extend built-in objects like list or dict. I couldn’t remember the last time I did it, but I wondered if there was any benefit to adding a bisect() operation to a list object, so I gave it a shot. It’s not hard or cool or anything. Here’s my list object extension:

class mylist(list):
   def bisect(self):
      return [mylist(self[:len(self)/2]),mylist(self[len(self)/2:])]

And to use it, you create a haystack that is a mylist object, then pass it to a recursive function that’ll call bisect()

def chop_bisect_object(needle, haystack):
   while True:
      head, tail = haystack.bisect()
      try:
         if needle < head[-1]:
            chop_bisect_object(needle, head)
            break
         elif needle > tail[0]:
            chop_bisect_object(needle, tail)
            break
         elif needle == head[0]:
            print "needle in head[0]"
            break
         elif needle == tail[0]:
            print "needle in tail[0]"
            break

if __name__ == "__main__":
   x = mylist(range(1000000))
   chop_bisect_object(needle,x)

Finally, Chef Guido Chops

Of course, this was all a fun exercise to get the synapses sparking. The reality is that, if all you really want to do is find out where in a list some value exists, you call l.index(val), where val is what you’re looking for. This method only runs slightly faster than the fastest algorithm I was able to come up with (for both smallish and largish haystacks), but it’s absolutely the most readable, and being that it’s so short, it has the least number of ways to introduce bugs, so it really is the right tool for the job in this case.

Onward

First, I don’t put forth these examples as flawless. Quite the contrary. I assume others have some cool ideas, or can point out some mistakes I’ve made, so please share in the comments!

I plan to go through the rest of the exercises in the Code Kata, and I also have a couple of interesting ones of my own I might throw in. I’ll try to post roughly once a week about this, so if you’re interested, subscribe to my RSS feed, or just the Python or CodeKata feeds.

Panorama Theme by Themocracy