2009 Python Meme

Heard about this from Tarek, and you can find more of them on Planet Python (where I found Tarek’s post).

  1. What’s the coolest Python application, framework or library you have discovered in 2009?
  2. Probably Tornado. Tornado is an interesting application, because it blurs the line a bit between a framework like Django and a traditional web server. If you can picture it, it’s a barebones, lightweight, almost overly simplified Django, with a production-ready web server instead of Django’s built-in dev server. In reality, Tornado is (or feels) more integrated than that, but that leads to some interesting issues on its own.

    Still, it’s been a heckuva lot of fun to play with. One thing that always concerned me about Django was the ORM. It’s fine for my little hobby website, or a simple wiki for my wife and I to use, and even some slightly more complex applications, but if you have a database-driven site that serves “lots and lots” of users and it needs to manage complex relationships and never slow down…. I don’t trust the ORM to do that. What’s more, I’m actually pretty skilled in data modeling, database administration, etc., and I understand abstraction. I don’t really require Django’s models (though, again, I love Django for doing low-traffic sites very quickly).

    Playing with web frameworks is a lot of fun, and if you’ve played with a few, you’ll like the “clean slate” that Tornado provides you to mix-n-match your favorite features of the frameworks you’ve used. I’ve done some hacking around Tornado to provide some generic facilities I’m likely to use in just about every project I use Tornado for. The sort of pseudo-framework is available as Tornado-project-stub on github.

  3. What new programming technique did you learn in 2009 ?
  4. Thread and process pool management. Whereas in previous roles I focused on optimization at the system and network level by testing/deploying new tools, poking at new paradigms, or just reworking/overhauling things that were modeled or configured suboptimally, my new role is something you really would call “scalability engineering”. I believe everything I’m involved in at the moment involves the words “distributed”, “asynchronous”, “multithreaded”, “multiprocess”, and other terms that imply “fast” (for varying definitions of fast, depending on the project).

    Though I’ve had to understand threading and multiprocessing (and microthreads and other stuff too) in the past, and I’ve even written simple threaded/multiprocessing stuff in the past, I’m now knee deep in it, and am getting into more complex scenarios where I really *need* a pool manager, and would really *like* to pass objects around on the network. Happily, I’m finding that Python has facilities for all of this built-in.

    Aside from that, I have to say that while most of what I’m doing now doesn’t involve techniques I’ve never heard of, I’m really reveling in the opportunity to put them into practice and actually use them. Also, since I now code full-time, I find the ability to code doesn’t ever escape my brain. I can code fast enough now that I can implement something two or three different ways to compare the solutions in no time!

  5. What’s the name of the open source project you contributed the most in 2009 ? What did you do ?
  6. Actually, it’s not released yet, but I’ve almost completed a rewrite of Golconde, a queue-based replication engine. I was able to make the queuing protocol, the message processor (the thing that takes queued messages and turns them into database operations), and the database backend swappable. Golconde was written to support STOMP queuing and PostgreSQL specifically. I’ve already used STOMP and AMQP interchangeably in the rewrite, and I’m now on to swapping out the database and message processor components, documenting how users can create their own plugins for these bits along the way.

    Golconde is also threaded. My rewrite is currently threaded as well, but I’m going to change out the threads in favor of processes, because the facilities that can help the project moving forward (in the short term, even) come more from multiprocessing than from threading. One thing I’ve already accomplished is refactoring so that there need be only a single thread class, which makes using worker pools more convenient and natural. It’s coming together!

  7. What was the Python blog or website you read the most in 2009 ?
  8. I read Planet Python every day, and keep up with Python Reddit as well. Besides aggregators, Corey Goldberg and Jesse Noller seem to overlap my areas of interest a lot, so I find myself actually showing up at their blogs pretty often. Neither of them blog enough, imo ;-)
  9. What are the three top things you want to learn in 2010?
    1. Nose – because I want to become so good at testing that it makes me more productive, not less. Right now I’m just clumsy at testing, and I come across situations that I just plain don’t know how to write a test for. I need to know more about writing mock objects and how to write tests for threaded/multiprocessing applications. I know enough to think that Nose is probably the way to go (I’ve used both unittest and doctest before, so I’m not totally ‘green’ to the notion of testing in general), but I haven’t been able to work it into my development process yet.
    2. Erlang – There doesn’t seem to be a language that makes concurrency quite as brainless as Erlang. That said, learning the language and OTP platform is *not* brainless.
    3. Sphinx – I hear all the cool kids are using it. Some people whose opinions I trust love it, but I have some reservations based on my experience with it. The one thing that gives me hope is that Django’s documentation site, which I like the interface and features of, uses it.

CodeKata 4: Data Munging

I’m continuing to take on the items in Dave Thomas’s Code Kata collection. It’s a nice way to spend a Sunday night, and it’s a good way to get my brain going again before work on Monday morning. It’s also fun and educational :)

CodeKata 4 is called “Data Munging”. It’s not very difficult data munging, really. I think the more interesting bit of the Kata is how to deal with files in different languages, what tools are well suited to the task, and trying to minimize code duplication.

Description

Code Kata 4 instructs us to download weather.dat, a listing of weather information for each day of June, 2002 in Morristown, NJ. We’re to write a program that identifies the day which has the smallest gap between min and max temperatures.

Then, it says to download football.dat, a listing of season statistics for Soccer teams. We’re to write a program that identifies the team with the smallest gap between goals scored for the team, and goals scored against the team.

Once those are done, we’re asked to try to factor out as much duplicate code as possible between the two programs, and then we’re asked a few questions to help us think more deeply about what just transpired.

My Attack Plan

The first thing that jumped into my brain when I saw the data files was “Awk would be perfect for this”. I fiddled with that for a little too long (my awk is a little rusty), and came up with this (for weather.dat):

>$ awk 'BEGIN {min=100000}; $1 ~ /^[1-3]/ {x[$1]=$2-$3; if (x[$1]<min){ min=x[$1]; winner=$1}} END {print winner, min} ' weather.dat.dat
14 2

It works, and awk, though it gets ugly to some, reads in a nice, linear way to me. You give it a filter expression, and then statements to act on the matching lines (in braces). What could be simpler?

After proving to myself that I hadn’t completely lost my awk-fu, I went about writing a Python script to deal with the problem. I read ahead in the problem description, though, and so my script contains separate blocks for the two data sets in one script:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()

if 'weather' in sys.argv[1]:
   winner = 1000000
   winnerday = None

   for line in data:
      #filter out lines that aren't numbered.
      if line.strip().startswith(tuple(string.digits)):
         # we only need the first three fields to do our work
         l = line.split()[:3]
         # some temps have asterisks attached to them.
         maxt = l[1].strip(string.punctuation)
         mint = l[2].strip(string.punctuation)
         diff = int(maxt) - int(mint)
         if diff < winner:
            winner = diff
            winnerday = l[0]
   print "On day %s, the temp difference was only %d degrees!" % (winnerday, winner)

if 'football' in sys.argv[1]:
   winner = 1000000
   winnerteam = None

   for line in data:
      if line.strip().startswith(tuple(string.digits)):
         l = line.split()
         team, f, a = l[1], int(l[6]), int(l[8])
         diff = abs(f - a)
         if diff < winner:
            winner = diff
            winnerteam = team
   print "Team %s had a for/against gap of only %d points!" % (winnerteam, winner)

Really, the logic employed is not much different from the awk solution:

  1. Set a default for ‘winner’ that’s unlikely to be rivaled by the data :)
  2. Set the default for the winning team or day to ‘None’
  3. Filter out unwanted lines in the dataset.
  4. Grab bits of each line that are useful.
  5. Assign each useful bit to a variable.
  6. Do math.
  7. Do comparisons
  8. Present results.

Refactoring

Part 2 of the Kata says to factor out as much duplicate code as possible. I was able to factor out almost all of it on the first shot at refactoring, leaving only the line of code (per file) to identify the relevant columns of each data set:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()
winner_val = 1000000
winner_id = None

for line in data:
   if line.strip().startswith(tuple(string.digits)):
      l = line.split()
      if 'weather' in sys.argv[1]:
         identifier, minuend, subtrahend = l[0], int(l[1].strip(string.punctuation)), int(l[2].strip(string.punctuation))
      elif 'football' in sys.argv[1]:
         identifier, minuend, subtrahend = l[1], int(l[6]), int(l[8])
      diff = abs(minuend - subtrahend)
      if diff < winner_val:
         winner_val = diff
         winner_id = identifier

print winner_id, winner_val

Not too bad. I could’ve done some other things to make things work differently: for example I could’ve let the user feed in the column header names of the ‘identifier’, ‘minuend’ and ‘subtrahend’ columns in each data set, and then I could just *not* parse out the header line and instead use it to identify the list index positions of the bits I need for each line. It’d make the whole thing ‘just work’. It would also require more effort from the user. On the other hand, it would make things “just work” for just about any file with numbered lines of columnar data.

I have to admit that the minute I see columnar data like this, awk is the first thing I reach for, so I’m sure this affected my Python solution. The good news there is that my thinking toward columnar data is consistent, and so I treated both files pretty much the same way, making refactoring a 5-minute process.

In all, I enjoyed this Kata. Though I didn’t take it as far as I could have, it did make me think about how it could be improved and made more generic. Those improvements could incur a cost in terms of readability I suppose, but I think for this example it wouldn’t be a problem. I’m working on a larger project now where I have precisely this issue of flexibility vs. readability though.

I’m reengineering a rather gangly application to enable things like pluggable… everything. It talks to networked queues, so the protocol is pluggable. It talks to databases, so the database back end is pluggable, in addition to the actual data processing routines. Enabling this level of flexibility introduces some complexity, and really requires good documentation if we reasonably expect people to work with our code (the project should be released as open source in the coming weeks). Without the documentation, I’m sure I’d have trouble maintaining the code myself!

PyYaml with Aliases and Anchors

I didn’t know this little tidbit until yesterday and want to get it posted so I can refer to it later.

I have this YAML config file that’s kinda long and has a lot of duplication in it. This isn’t what I’m working on, but let’s just say that you have a bunch of backup targets defined in your YAML config file, and your program rocks because each backup target can be defined to go to a different destination. Awesome, right?

Well, it might be, but it might also just make your YAML config file grotesque (and error-prone). Here’s an example:

Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2

Now with two backups, this isn’t so bad. But if your environment has 100 backup targets and only one destination, or…. heck — even if there are three destinations — should you have to write out the definition of those same three destinations for each of 100 backup targets? What if you need to change how one of the destinations is connected to, or the name of a destination changes, or array23 dies?

Ideally, you’d be able to reference the same definition in as many places as you need it and have things “just work”, and if something needs to change, you just change it in one place. Enter anchors and aliases.

An anchor is defined just like anything else in YAML with the exception that you get to label the definition block using “&labelname”, and then you can (de)reference it elsewhere in your config with “*labelname”. So here’s how our above configuration would look:

BackupDestination-23: &Backup_To_ARRAY23
    host: bar
    dir: /mnt/array23/homes/jonesy
    check_space: true
    min_space: 80G
    num_archives: 4
    compress: bzip2
Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23

With only two backup targets, the benefit is small, but keep trying to imagine this config file with about 100 backup targets, and only one or two destinations. This removes a lot of duplication and makes things easier to change and maintain (and read!)

The cool thing about it is that if you already have code that reads the YAML config file, you don’t have to change it at all — PyYaml expands everything for you. Here’s a quick interpreter session:

>>> import yaml
>>> from pprint import pprint
>>> stream = file('foo.yaml', 'r')
>>> cfg = yaml.load(stream)
>>> pprint(cfg)
{'BackupDestination-23': {'check_space': True,
                          'compress': 'bzip2',
                          'dir': '/mnt/array23/homes/jonesy',
                          'host': 'bar',
                          'min_space': '80G',
                          'num_archives': 4},
 'Backups': {'Home_Jonesy': {'Destination': {'check_space': True,
                                             'compress': 'bzip2',
                                             'dir': '/mnt/array23/homes/jonesy',
                                             'host': 'bar',
                                             'min_space': '80G',
                                             'num_archives': 4},
                             'dir': '/Users/jonesy',
                             'host': 'foo',
                             'keyloc': '~/.ssh/id_rsa.pub',
                             'protocol': 'ssh'},
             'Home_Molly': {'Destination': {'check_space': True,
                                            'compress': 'bzip2',
                                            'dir': '/mnt/array23/homes/jonesy',
                                            'host': 'bar',
                                            'min_space': '80G',
                                            'num_archives': 4},
                            'dir': '/Users/molly',
                            'host': 'eggs',
                            'keyloc': '~/.ssh/id_rsa.pub',
                            'protocol': 'sftp'}}}

…And notice how everything has been expanded.

Enjoy!

Python Code Kata 2: Karate Chop (Binary Search)

I’ve decided it’d be fun to go through the Code Kata put together by Dave Thomas, co-author of The Pragmatic Programmer, and co-founder of Pragmatic Programmers, LLC (publishers of the finest tech books on the market right now; they can’t expand their selection fast enough for me).

Anyway, the first Code Kata, while being truly interesting to ponder, doesn’t really involve code, so I’m starting with Kata 2, which is a binary search exercise. It occurred to me while reading it over that I’d never written a binary search in Python, because you typically don’t need to, since list objects have a method called ‘index()’ which will hand you the index position of your search target within the list. I decided to force myself not to rely on it, as an exercise. If you’ve never had to go through a computer science curriculum, but want to excel at programming, I suggest doing all of the Kata, and getting your hands on at least one good computer science text (a computer science grad student hanging around is nice, too!)

Binary Search Background

So, the idea behind a binary search is often likened to looking up a name in a phone book, which is a pretty good analogy. A phone book has thousands of pages and perhaps millions of names, and yet it doesn’t take us long to find any given name, no matter which part of the book it’s in. You open the book to the middle, and the name you’re looking for is either on one of the facing pages, in the first half of the book, or in the last half of the book. After glancing at a single page, you’ve reduced your haystack by 50%. You flip to the middle of the proper half of the book, and repeat the process. If you had a phone book listing every person in the entire world (say, 7,000,000,000 people), you could find any given person in 33 steps or less. That’s pretty efficient (in big O notation, it’s O(logn)).

You can use a binary algorithm to tease your friends, too. Bet a friend $1 that you can guess any number they pick between 1 and 100 in 7 guesses if they tell you if your guess is too low or too high. If you use a binary algorithm to make your guesses, it’s not possible to lose this bet as far as I know. Here’s an example:

Your buddy chooses number 70. Your guesses will be:

  1. 50 (too low)
  2. 75 (too high)
  3. 62 (too low)
  4. 69 (too low)
  5. 72 (too high)
  6. 71 (too high)
  7. 70 (that’s it!)

One more. Your buddy chooses the number 1. Your guesses will be:

  1. 50 (too high)
  2. 25 (too high)
  3. 13 (too high)
  4. 7 (too high)
  5. 4 (too high)
  6. 2 (too high)
  7. 1  (that’s it!)

But what if your buddy wants to up the stakes by increasing the scope to numbers between 1 and 1000? Say he picks 485:

  1. 500 (too high)
  2. 250 (too low)
  3. 375 (too low)
  4. 438 (too low)
  5. 469 (too low)
  6. 485 (that’s it!)

You got lucky on that last one, but you might’ve negotiated more guesses for yourself. To do that and still guarantee a win, figure out what log2N is, where N is the highest number they can choose. So, log2100 = 7, and log21000 = 10. Another tip: when calculating your next guess, always round your division UP. Notice that log27000000000 = 32.7, which is where I got the max of 33 steps above.

Ok, now on to the Kata!!

Binary Search in Python Take 1: Traditional Binary Search

This is just my take on a traditional binary search algorithm in Python:

def chop(needle, haystack):
   """
   Takes an integer target (needle) and a sorted list of integers (haystack),
   and performs a binary search for the target. Return the integer index of the
   target in the array, or -1 if it's not found.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

This makes assumptions. Namely, it assumes a 0-indexed, sorted list with no missing values. So, my test lists are always output from range(N), where N is the number of elements I want to search. It works well enough. I think a CS prof would accept this. Let’s move on.

What if I break it in *THREE* (oooooohhhhhh)

If breaking the list in two is so cool, why don’t we break it in three pieces? We can do one extra check, and then cut out 2/3 of the haystack in one fell swoop! Ok, let’s see how it goes:

def chop3(needle, haystack):

   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      brk1 = (lo+hi)/3
      brk2 = (2 * (lo+hi)) / 3
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2

This one attempts to cut the initial array into three parts instead of bisecting. This won’t work, actually. At some point, you’ll shrink the list down two where the lower bound is 1/2 the upper bound, at which point, all hell breaks loose. Think about how this works when lo is 16 and hi is 32. 16+32 = 48, and 48/3 = 16! This means brk1 never actually changes, and you’ve created perfect conditions for an infinite loop.

Instead, define the hi and lo variables and break points in the list differently:

def chop3_take2(needle, haystack):
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      brk1 = lo+incr
      brk2 = lo + (incr * 2)
      val1 = haystack[brk1]
      val2 = haystack[brk2]
      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2
      elif needle == val1:
         break
      elif needle == val2:
         break

This one will work. It’s slightly more heavy lifting than chop2, but not too bad. In the calculation for the brk variables, I’ve decided to calculate an increment defined as 1/3 the distance between lo and hi. So, if lo=16 and hi=32, instead of brk1 becoming 16 and looping forever, it becomes the point 1/3 of the way between lo and hi. It makes fewer assumptions, which in my experience means fewer bugs. (32-16)/3 == 5 (the remainder in integer division is truncated), so brk1 is going to be 21 instead of 16.

From simple tests, it doesn’t appear that this extra bit munging buys you anything. On a list with 1 million elements, chop() and chop3_take2() run in almost the exact same amount of time. Increasing the list size to 10 million elements starts to show evidence that the initial chop() routine is going to scale better than chop3_take2(). I think chop() is also nicer to read and understand, which makes it a better candidate for production code.

Combine the two?

I sorta wondered what would happen if I only did the 2/3 reduction on the inital pass, and then passed the part of the array where the target exists to chop(). The function would look a lot like chop3_take2, but inside the if…elif parts, instead of setting new vals for lo or hi, we’d pass the part of haystack bounded by lo and hi to chop().

A nice idea except that, if you recall, chop() assumes a zero-indexed array, and our chop3_take2() function sets new values for lo in some instances, so what we really need is a chop_non_zero_index() function. Here’s one:

def chop_non_zero_index(needle, haystack):
   lo = 0
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

And then we pass portions of our haystack to chop_non_zero_index() from a function called chop_salad(). The chop_salad() function does an initial pass to cut out 2/3 of the candidates, then delegates to chop_non_zero_index() for the rest.

def chop_salad(needle, haystack):
   """Same as take 1 above, but this time we pass to chop_non_zero_index
   after the first pass.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      print "increment = %d" % incr
      brk1 = lo+incr
      print "brk1 = %d" % brk1
      brk2 = lo + (incr * 2)
      print "brk2 = %d" % brk2
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         chop_non_zero_index(needle, haystack[:brk1+1])
         break
      elif needle > val1 and needle < val2:
         chop_non_zero_index(needle, haystack[brk1:brk2+1])
         break
      elif needle > val2:
         chop_non_zero_index(needle, haystack[brk2:])
         break
      elif needle == val1:
         print "Found it @val1!"
         break
      elif needle == val2:
         print "Found it @val2!"
         break

At this point, we’re getting into doing more list manipulations, more comparisons, passing stuff between functions… not nearly as clean and simple as plain old chop(), and it doesn’t buy you anything in terms of performance either.

Extend the List Object, and using Recursion

I almost never extend built-in objects like list or dict. I couldn’t remember the last time I did it, but I wondered if there was any benefit to adding a bisect() operation to a list object, so I gave it a shot. It’s not hard or cool or anything. Here’s my list object extension:

class mylist(list):
   def bisect(self):
      return [mylist(self[:len(self)/2]),mylist(self[len(self)/2:])]

And to use it, you create a haystack that is a mylist object, then pass it to a recursive function that’ll call bisect()

def chop_bisect_object(needle, haystack):
   while True:
      head, tail = haystack.bisect()
      try:
         if needle < head[-1]:
            chop_bisect_object(needle, head)
            break
         elif needle > tail[0]:
            chop_bisect_object(needle, tail)
            break
         elif needle == head[0]:
            print "needle in head[0]"
            break
         elif needle == tail[0]:
            print "needle in tail[0]"
            break

if __name__ == "__main__":
   x = mylist(range(1000000))
   chop_bisect_object(needle,x)

Finally, Chef Guido Chops

Of course, this was all a fun exercise to get the synapses sparking. The reality is that, if all you really want to do is find out where in a list some value exists, you call l.index(val), where val is what you’re looking for. This method only runs slightly faster than the fastest algorithm I was able to come up with (for both smallish and largish haystacks), but it’s absolutely the most readable, and being that it’s so short, it has the least number of ways to introduce bugs, so it really is the right tool for the job in this case.

Onward

First, I don’t put forth these examples as flawless. Quite the contrary. I assume others have some cool ideas, or can point out some mistakes I’ve made, so please share in the comments!

I plan to go through the rest of the exercises in the Code Kata, and I also have a couple of interesting ones of my own I might throw in. I’ll try to post roughly once a week about this, so if you’re interested, subscribe to my RSS feed, or just the Python or CodeKata feeds.

Python, PostgreSQL, and psycopg2′s Dusty Corners

Last time I wrote code with psycopg2 was around 2006, but I was reacquainted with it over the past couple of weeks, and I wanted to make some notes on a couple of features that are not well documented, imho. Portions of this post have been snipped from mailing list threads I was involved in.

Calling PostgreSQL Functions with psycopg2

So you need to call a function. Me too. I had to call a function called ‘myapp.new_user’. It expects a bunch of input arguments. Here’s my first shot after misreading some piece of some example code somewhere:

qdict = {'fname': self.fname, 'lname': self.lname, 'dob': self.dob, 'city': self.city, 'state': self.state, 'zip': self.zipcode}

sqlcall = """SELECT * FROM myapp.new_user( %(fname)s, %(lname)s,
%(dob)s, %(city)s, %(state)s, %(zip)s""" % qdict

curs.execute(sqlcall)

There’s no reason this should work, or that anyone should expect it to work. I just wanted to include it in case someone else made the same mistake. Sure, the proper arguments are put in their proper places in ‘sqlcall’, but they’re not quoted at all.

Of course, I foolishly tried going back and putting quotes around all of those named string formatting arguments, and of course that fails when you have something like a quoted “NULL” trying to move into a date column. It has other issues too, like being error-prone and a PITA, but hey, it was pre-coffee time.

What’s needed is a solution whereby psycopg2 takes care of the formatting for us, so that strings become strings, NULLs are passed in a way that PostgreSQL recognizes them, dates are passed in the proper format, and all that jazz.

My next attempt looked like this:

curs.execute("""SELECT * FROM myapp.new_user( %(fname)s, %(lname)s,
%(dob)s, %(city)s, %(state)s, %(zip)s""", qdict)

This is, according to some articles, blog posts, and at least one reply on the psycopg mailing list “the right way” to call a function using psycopg2 with PostgreSQL. I’m here to tell you that this is not correct to the best of my knowledge.The only real difference between this attempt and the last is I’ve replaced the “%” with a comma, which turns what *was* a string formatting operation into a proper SELECT with a psycopg2-recognized parameter list. I thought this would get psycopg2 to “just work”, but no such luck. I still had some quoting issues.

I have no idea where I read this little tidbit about psycopg2 being able to convert between Python and PostgreSQL data types, but I did. Right around the same time I was thinking “it’s goofy to issue a SELECT to call a function that doesn’t really want to SELECT anything. Can’t callproc() do this?” Turns out callproc() is really the right way to do this (where “right” is defined by the DB-API which is the spec for writing a Python database module). Also turns out that psycopg2 can and will do the type conversions. Properly, even (in my experience so far).

So here’s what I got to work:

callproc_params = [self.fname, self.lname, self.dob, self.city, self.state, self.zipcode]

curs.callproc('myapp.new_user', callproc_params)

This is great! Zero manual quoting or string formatting at all! And no “SELECT”. Just call the procedure and pass the parameters. The only thing I had to change in my code was to make my ‘self.dob’ into a datetime.date() object, but that’s super easy, and after that psycopg2 takes care of the type conversion from a Python date to a PostgreSQL date. Tomorrow I’m actually going to try calling callproc() with a list object inside the second argument. Wish me luck!

A quick cursor gotcha

I made a really goofy mistake. At the root of it, what I did was share a connection *and a cursor object* among all methods of a class I created to abstract database operations out of my code. So, I did something like this (this is not the exact code, and it’s untested. Treat it like pseudocode):

class MyData(object):
   def __init__(self, dsn): 
      self.conn = psycopg2.Connection(dsn)
      self.cursor = self.conn.cursor()

   def get_users_by_regdate(self, regdate, limit):
      self.cursor.arraysize = limit 
      self.cursor.callproc('myapp.uid_by_regdate', regdate)
      while True: 
         result = self.cursor.fetchmany()
         if not result:
            break 
         yield result

   def user_is_subscribed(self, uid): 
      self.cursor.callproc('myapp.uid_subscribed', uid)
      result = self.cursor.fetchone()
      val = result[0]
      return val

Now, in the code that uses this class, I want to grab all of the users registered on a given date, and see if they’re subscribed to, say, a mailing list, an RSS feed, a service, or whatever. See if you can predict the issue I had when I executed this:

    db = MyData(dsn)
    for id in db.get_users_by_regdate([joindate]):
        idcount += 1
        print idcount
        param = [id]
        if db.user_is_subscribed(param):
            print "User subscribed"
            skip_count += 1
            continue
        else:
            print "Not good"
            continue

Note that the above is test code. I don’t actually want to continue to the top of the loop regardless of what happens in production :)

So what I found happening is that, if I just commented out the portion of the code that makes a database call *inside* the for loop, I could print ‘idcount’ all the way up to thousands of results (however many results there were). But if I left it in, only 100 results made it to ‘db.user_is_subscribed’.

Hey, ’100′ is what I’d set the curs.arraysize() to! Hey, I’m using the *same cursor* to make both calls! And with the for loop, the cursor is being called upon to produce one recordset while it’s still trying to produce the first recordset!

Tom Roberts, on the psycopg list, states the issue concisely:

The cursor is stateful; it only contains information about the last
query that was executed. On your first call to “fetchmany”, you fetch a
block of results from the original query, and cache them. Then,
db.user_is_subscribed calls “execute” again. The cursor now throws away all
of the information about your first query, and fetches a new set of
results. Presumably, user_is_subscribed then consumes that dataset and
returns. Now, the cursor is position at end of results. The rows you
cached get returned by your iterator, then you call fetchmany again, but
there’s nothing left to fetch…

…So, the lesson is if you need a new recordset, you create a new cursor.

Lesson learned. I still think it’d be nice if psycopg2 had more/better docs, though.