Category: Sysadmin

If You Don’t Date Your Work, It Sucks.

By m0j0, January 18, 2010 8:39 am

I probably get more upset than is reasonable when I come across articles with no date on them. I scroll furiously for a few minutes, try to see if the date was put in some stupid place like the fine print written in almost-white-on-white at the bottom of the post surrounded by ads. Then I skim the article looking for references to software versions that might clue me in on how old this material is. Then I check the sidebars to see if there’s some kind of “About this Post” block. Finally, I make a mental note of the domain in a little mental list I use to further filter my Google searches in the future. Then I close the browser window in disgust. If it weren’t completely gross and socially unacceptable to do so, I would spit on the floor every time this happened.

Why would you NOT date your articles? In almost every single theme for every single content management solution written in any language and backed by any database, “Date” is a default element. Why would you remove it? It is almost guaranteed to be more work to remove it. Why would you go through actual work to make your own writing less useful to others?

What happens when you don’t date your articles?

  1. People have no idea whether your article has anything to do with what they’re working on.  If you wrote an article about the Linux kernel in 1996, it’s of no use to me *now*, even if it was pretty hardcore at the time.
  2. Readers are forced to skim your article looking for references to software versions to see if your article is actually meaningful to them or not. Why make it hard for people to know whether your article is useful? The only reason I can think of is that you already know your articles are old, so not dating them insures that people at least skim enough to see some of the ads on your site. You are irreversibly lame if you do this.
  3. It causes near seizures in people like me who really hate when you don’t date your work, as well as all of your past teachers, who no doubt demanded that you sign and date your work.
  4. Every time you don’t date an article online, a seal pup is clubbed to death in the arctic, and a polar bear gets stranded on a piece of ice.

At some point, I will make an actual list of web sites that regularly do not date their work. A sort of hall of shame for sites that fail to link their writing to some kind of time-based context. If you have sites you’d like to add, let me know in the comments.

Head first into javascript (and jQuery)

By m0j0, January 12, 2010 8:56 pm

So, I had to take a break from doing the Code Katas just as I was getting to the really cool one about Bloom Filters. The reason for the unexpected break from kata-ing was that I had a project thrown into my lap. I say “project” not because it was some huge thing that needed doing — lots of people reading this could probably have done it in a matter of a few hours — but because it involved two things I’ve never done any real work with: javascript, and jQuery.

My task? Well, first I had to recreate a page from a graphic designer’s mockup. So, take a JPEG image and create the CSS and stuff to make that happen. Already I’m out of my comfort zone, because historically I’m a back-end developer more comfortable with threading than CSS (95% of my code is written in Python and is daemonized/threaded/multiprocess/all-of-the-above or worse), but at least I’ve done enough CSS to get by.

Once the CSS was done, I was informed that I’d now need to take the tabular reporting table I just created and make it sort on any of the columns, get the data via AJAX calls to a flat file that would store the JSON for the dummy data, create nice drop-down date pickers so date ranges in the report could be chosen by the end user, page the data using a flickr-style pager so only 20 lines would show up on a page, and alternate the row colors in the table (and make sure that doesn’t break when you sort!).

How to learn javascript and/or jQuery REALLY fast

How exactly do you learn enough javascript and jQuery to get this done in a matter of a few days (counting from *after* the CSS part was done)? Here are some links you should keep handy if you have a situation like this arise:

  • If Douglas Crockford says it, listen. I’d advise you start here (part I of a 4-part intro to javascript). That site also has his ‘Advanced Javascript’ series. He also wrote a book, which is small enough to read quickly, and well done.
  • Packt has a lot of decent resources for jQuery. Specifically, this article helped me organize what I was doing in my head. The code itself had some rather glaring issues — you’re not going to cut-n-paste this code and deploy it to production, but coming from scorched earth, I really learned a lot.
  • After the project was already over I found this nice writeup that covers quick code snippets and demos illustrating some niceties like sliding panels and disappearing table rows and how to do them with jQuery.
  • jQuery itself has some pretty decent documentation for those times when your cut-n-pasted code looks a little suspect or you’re just sure there’s a better way. Easy to read and concise.

Why I Wrote My Own Sorting/Paging in jQuery

Inevitably, someone out there is wondering why I didn’t just use tablesorter and tablesorter.pager, or Flexigrid, or something like that. The answer, in a nutshell, is paging. Sorting and paging operations, I learned both by experience and in my reading, *NEED* to know about each other. If they don’t, you’ll get lots of weird results, like sorting on just one page (or, sorting on just one page until you click to another page, which will look as expected, and then click back), or pages with rows on them that are just plain wrong, or… the list goes on. This is precisely the problem that the integrated “all-sorting-all-paging” tools like tablesorter try to solve. The issue is that I could not find a SINGLE ONE that did not have a narrow definition of what a pager was, and what it looked like.

I wanted (well, I was required to mimic the mockup, so “needed”) a flickr-style pager — modified. I needed to have each page of the report represented at the bottom of the report table by a block with the proper number in the block. The block would be clickable, and clicking it would show the corresponding page of data. This is more or less what Flickr does, but I didn’t need the “previous” and “next” buttons, and I didn’t need the “…” they use (rather effectively) to cut down on the number of required pager elements. So… just some blocks with page numbers. That’s it.

I started out using tablesorter for jQuery, and it worked great — it does the sorting for you, manages the alternating row colors, and is a pretty flexible sorter. Then I got to the paging part, and things went South pretty fast. While tablesorter.pager has a ‘moveToPage’ function, it’s not exposed so you can bind it to a CSS selector like the ‘moveToPrevious’, ‘moveToLast’, ‘moveToNext’ and other functions are. So, I tried to hack it into the pager code myself. I got weird results (though I feel more confident about approaching that now than I did even three days ago). There wasn’t any obvious way to do anything but give the user *only* first/last/previous/next buttons to control the paging. I moved on. I googled, I asked on jQuery IRC, I even wrote the developer of tablesorter. I got nothing.

I looked at 4 or 5 different tools, and was shocked to find the same thing! I didn’t go digging into the code of all of them, but their documentation all seemed to be in some kind of weird denial about the existence of flickr-style paging altogether!

So, I wrote my own. It wasn’t all that difficult, really. The code that worked was only slightly different from the code I’d fought with early on in the process. It just took some reading to get some of the basic tricks of the trade under my belt, and I got a tip or two from one of the gurus at work as well, and I was off to the races!

Lessons Learned

So, one thing I have to say for my boss is that he knows better than to throw *all* of those things at me at once. Had he come to me and said he wanted an uber-ajaxian reporting interface from outer space from the get-go, I might not have responded even as positively as I did (and I would rate my response as ‘tepid, but attempting a positive outlook’) . It’s best to draw me in slowly, a task at a time, so I can get some sense of accomplishment and some feedback along the way instead of feeling like I still have this mountain to climb before it’s over.

I certainly learned that this javascript and jQuery (and AJAX) stuff isn’t really black magic. Once you get your hands dirty with it it’s kinda fun. I still don’t ever want to become a front end developer on a full-time basis (browser testing is where I *really* have zero patience, either for myself or the browsers), but this experience will serve me well in making my own projects a little prettier and slicker, and nicer to use. It’ll also help me understand more about what the front end folks are dealing with, since there’s tons of javascript at myYearbook.

So, I hope this post keeps some back end scalability engineer’s face from turning white when they’re given a front-end AJAX project to do. If you’ve ever had a similar situation happen to you (not necessarily related to javascript, but other technologies you didn’t know until you were thrown into a project), let’s hear the war stories!!

2009 Python Meme

By m0j0, December 29, 2009 9:58 pm

Heard about this from Tarek, and you can find more of them on Planet Python (where I found Tarek’s post).

  1. What’s the coolest Python application, framework or library you have discovered in 2009?
  2. Probably Tornado. Tornado is an interesting application, because it blurs the line a bit between a framework like Django and a traditional web server. If you can picture it, it’s a barebones, lightweight, almost overly simplified Django, with a production-ready web server instead of Django’s built-in dev server. In reality, Tornado is (or feels) more integrated than that, but that leads to some interesting issues on its own.

    Still, it’s been a heckuva lot of fun to play with. One thing that always concerned me about Django was the ORM. It’s fine for my little hobby website, or a simple wiki for my wife and I to use, and even some slightly more complex applications, but if you have a database-driven site that serves “lots and lots” of users and it needs to manage complex relationships and never slow down…. I don’t trust the ORM to do that. What’s more, I’m actually pretty skilled in data modeling, database administration, etc., and I understand abstraction. I don’t really require Django’s models (though, again, I love Django for doing low-traffic sites very quickly).

    Playing with web frameworks is a lot of fun, and if you’ve played with a few, you’ll like the “clean slate” that Tornado provides you to mix-n-match your favorite features of the frameworks you’ve used. I’ve done some hacking around Tornado to provide some generic facilities I’m likely to use in just about every project I use Tornado for. The sort of pseudo-framework is available as Tornado-project-stub on github.

  3. What new programming technique did you learn in 2009 ?
  4. Thread and process pool management. Whereas in previous roles I focused on optimization at the system and network level by testing/deploying new tools, poking at new paradigms, or just reworking/overhauling things that were modeled or configured suboptimally, my new role is something you really would call “scalability engineering”. I believe everything I’m involved in at the moment involves the words “distributed”, “asynchronous”, “multithreaded”, “multiprocess”, and other terms that imply “fast” (for varying definitions of fast, depending on the project).

    Though I’ve had to understand threading and multiprocessing (and microthreads and other stuff too) in the past, and I’ve even written simple threaded/multiprocessing stuff in the past, I’m now knee deep in it, and am getting into more complex scenarios where I really *need* a pool manager, and would really *like* to pass objects around on the network. Happily, I’m finding that Python has facilities for all of this built-in.

    Aside from that, I have to say that while most of what I’m doing now doesn’t involve techniques I’ve never heard of, I’m really reveling in the opportunity to put them into practice and actually use them. Also, since I now code full-time, I find the ability to code doesn’t ever escape my brain. I can code fast enough now that I can implement something two or three different ways to compare the solutions in no time!

  5. What’s the name of the open source project you contributed the most in 2009 ? What did you do ?
  6. Actually, it’s not released yet, but I’ve almost completed a rewrite of Golconde, a queue-based replication engine. I was able to make the queuing protocol, the message processor (the thing that takes queued messages and turns them into database operations), and the database backend swappable. Golconde was written to support STOMP queuing and PostgreSQL specifically. I’ve already used STOMP and AMQP interchangeably in the rewrite, and I’m now on to swapping out the database and message processor components, documenting how users can create their own plugins for these bits along the way.

    Golconde is also threaded. My rewrite is currently threaded as well, but I’m going to change out the threads in favor of processes, because the facilities that can help the project moving forward (in the short term, even) come more from multiprocessing than from threading. One thing I’ve already accomplished is refactoring so that there need be only a single thread class, which makes using worker pools more convenient and natural. It’s coming together!

  7. What was the Python blog or website you read the most in 2009 ?
  8. I read Planet Python every day, and keep up with Python Reddit as well. Besides aggregators, Corey Goldberg and Jesse Noller seem to overlap my areas of interest a lot, so I find myself actually showing up at their blogs pretty often. Neither of them blog enough, imo ;-)
  9. What are the three top things you want to learn in 2010?
    1. Nose – because I want to become so good at testing that it makes me more productive, not less. Right now I’m just clumsy at testing, and I come across situations that I just plain don’t know how to write a test for. I need to know more about writing mock objects and how to write tests for threaded/multiprocessing applications. I know enough to think that Nose is probably the way to go (I’ve used both unittest and doctest before, so I’m not totally ‘green’ to the notion of testing in general), but I haven’t been able to work it into my development process yet.
    2. Erlang – There doesn’t seem to be a language that makes concurrency quite as brainless as Erlang. That said, learning the language and OTP platform is *not* brainless.
    3. Sphinx – I hear all the cool kids are using it. Some people whose opinions I trust love it, but I have some reservations based on my experience with it. The one thing that gives me hope is that Django’s documentation site, which I like the interface and features of, uses it.

CodeKata 4: Data Munging

By m0j0, December 28, 2009 11:10 pm

I’m continuing to take on the items in Dave Thomas’s Code Kata collection. It’s a nice way to spend a Sunday night, and it’s a good way to get my brain going again before work on Monday morning. It’s also fun and educational :)

CodeKata 4 is called “Data Munging”. It’s not very difficult data munging, really. I think the more interesting bit of the Kata is how to deal with files in different languages, what tools are well suited to the task, and trying to minimize code duplication.

Description

Code Kata 4 instructs us to download weather.dat, a listing of weather information for each day of June, 2002 in Morristown, NJ. We’re to write a program that identifies the day which has the smallest gap between min and max temperatures.

Then, it says to download football.dat, a listing of season statistics for Soccer teams. We’re to write a program that identifies the team with the smallest gap between goals scored for the team, and goals scored against the team.

Once those are done, we’re asked to try to factor out as much duplicate code as possible between the two programs, and then we’re asked a few questions to help us think more deeply about what just transpired.

My Attack Plan

The first thing that jumped into my brain when I saw the data files was “Awk would be perfect for this”. I fiddled with that for a little too long (my awk is a little rusty), and came up with this (for weather.dat):

>$ awk 'BEGIN {min=100000}; $1 ~ /^[1-3]/ {x[$1]=$2-$3; if (x[$1]<min){ min=x[$1]; winner=$1}} END {print winner, min} ' weather.dat.dat
14 2

It works, and awk, though it gets ugly to some, reads in a nice, linear way to me. You give it a filter expression, and then statements to act on the matching lines (in braces). What could be simpler?

After proving to myself that I hadn’t completely lost my awk-fu, I went about writing a Python script to deal with the problem. I read ahead in the problem description, though, and so my script contains separate blocks for the two data sets in one script:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()

if 'weather' in sys.argv[1]:
   winner = 1000000
   winnerday = None

   for line in data:
      #filter out lines that aren't numbered.
      if line.strip().startswith(tuple(string.digits)):
         # we only need the first three fields to do our work
         l = line.split()[:3]
         # some temps have asterisks attached to them.
         maxt = l[1].strip(string.punctuation)
         mint = l[2].strip(string.punctuation)
         diff = int(maxt) - int(mint)
         if diff < winner:
            winner = diff
            winnerday = l[0]
   print "On day %s, the temp difference was only %d degrees!" % (winnerday, winner)

if 'football' in sys.argv[1]:
   winner = 1000000
   winnerteam = None

   for line in data:
      if line.strip().startswith(tuple(string.digits)):
         l = line.split()
         team, f, a = l[1], int(l[6]), int(l[8])
         diff = abs(f - a)
         if diff < winner:
            winner = diff
            winnerteam = team
   print "Team %s had a for/against gap of only %d points!" % (winnerteam, winner)

Really, the logic employed is not much different from the awk solution:

  1. Set a default for ‘winner’ that’s unlikely to be rivaled by the data :)
  2. Set the default for the winning team or day to ‘None’
  3. Filter out unwanted lines in the dataset.
  4. Grab bits of each line that are useful.
  5. Assign each useful bit to a variable.
  6. Do math.
  7. Do comparisons
  8. Present results.

Refactoring

Part 2 of the Kata says to factor out as much duplicate code as possible. I was able to factor out almost all of it on the first shot at refactoring, leaving only the line of code (per file) to identify the relevant columns of each data set:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()
winner_val = 1000000
winner_id = None

for line in data:
   if line.strip().startswith(tuple(string.digits)):
      l = line.split()
      if 'weather' in sys.argv[1]:
         identifier, minuend, subtrahend = l[0], int(l[1].strip(string.punctuation)), int(l[2].strip(string.punctuation))
      elif 'football' in sys.argv[1]:
         identifier, minuend, subtrahend = l[1], int(l[6]), int(l[8])
      diff = abs(minuend - subtrahend)
      if diff < winner_val:
         winner_val = diff
         winner_id = identifier

print winner_id, winner_val

Not too bad. I could’ve done some other things to make things work differently: for example I could’ve let the user feed in the column header names of the ‘identifier’, ‘minuend’ and ’subtrahend’ columns in each data set, and then I could just *not* parse out the header line and instead use it to identify the list index positions of the bits I need for each line. It’d make the whole thing ‘just work’. It would also require more effort from the user. On the other hand, it would make things “just work” for just about any file with numbered lines of columnar data.

I have to admit that the minute I see columnar data like this, awk is the first thing I reach for, so I’m sure this affected my Python solution. The good news there is that my thinking toward columnar data is consistent, and so I treated both files pretty much the same way, making refactoring a 5-minute process.

In all, I enjoyed this Kata. Though I didn’t take it as far as I could have, it did make me think about how it could be improved and made more generic. Those improvements could incur a cost in terms of readability I suppose, but I think for this example it wouldn’t be a problem. I’m working on a larger project now where I have precisely this issue of flexibility vs. readability though.

I’m reengineering a rather gangly application to enable things like pluggable… everything. It talks to networked queues, so the protocol is pluggable. It talks to databases, so the database back end is pluggable, in addition to the actual data processing routines. Enabling this level of flexibility introduces some complexity, and really requires good documentation if we reasonably expect people to work with our code (the project should be released as open source in the coming weeks). Without the documentation, I’m sure I’d have trouble maintaining the code myself!

PyYaml with Aliases and Anchors

By m0j0, December 22, 2009 8:10 am

I didn’t know this little tidbit until yesterday and want to get it posted so I can refer to it later.

I have this YAML config file that’s kinda long and has a lot of duplication in it. This isn’t what I’m working on, but let’s just say that you have a bunch of backup targets defined in your YAML config file, and your program rocks because each backup target can be defined to go to a different destination. Awesome, right?

Well, it might be, but it might also just make your YAML config file grotesque (and error-prone). Here’s an example:

Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2

Now with two backups, this isn’t so bad. But if your environment has 100 backup targets and only one destination, or…. heck — even if there are three destinations — should you have to write out the definition of those same three destinations for each of 100 backup targets? What if you need to change how one of the destinations is connected to, or the name of a destination changes, or array23 dies?

Ideally, you’d be able to reference the same definition in as many places as you need it and have things “just work”, and if something needs to change, you just change it in one place. Enter anchors and aliases.

An anchor is defined just like anything else in YAML with the exception that you get to label the definition block using “&labelname”, and then you can (de)reference it elsewhere in your config with “*labelname”. So here’s how our above configuration would look:

BackupDestination-23: &Backup_To_ARRAY23
    host: bar
    dir: /mnt/array23/homes/jonesy
    check_space: true
    min_space: 80G
    num_archives: 4
    compress: bzip2
Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23

With only two backup targets, the benefit is small, but keep trying to imagine this config file with about 100 backup targets, and only one or two destinations. This removes a lot of duplication and makes things easier to change and maintain (and read!)

The cool thing about it is that if you already have code that reads the YAML config file, you don’t have to change it at all — PyYaml expands everything for you. Here’s a quick interpreter session:

>>> import yaml
>>> from pprint import pprint
>>> stream = file('foo.yaml', 'r')
>>> cfg = yaml.load(stream)
>>> pprint(cfg)
{'BackupDestination-23': {'check_space': True,
                          'compress': 'bzip2',
                          'dir': '/mnt/array23/homes/jonesy',
                          'host': 'bar',
                          'min_space': '80G',
                          'num_archives': 4},
 'Backups': {'Home_Jonesy': {'Destination': {'check_space': True,
                                             'compress': 'bzip2',
                                             'dir': '/mnt/array23/homes/jonesy',
                                             'host': 'bar',
                                             'min_space': '80G',
                                             'num_archives': 4},
                             'dir': '/Users/jonesy',
                             'host': 'foo',
                             'keyloc': '~/.ssh/id_rsa.pub',
                             'protocol': 'ssh'},
             'Home_Molly': {'Destination': {'check_space': True,
                                            'compress': 'bzip2',
                                            'dir': '/mnt/array23/homes/jonesy',
                                            'host': 'bar',
                                            'min_space': '80G',
                                            'num_archives': 4},
                            'dir': '/Users/molly',
                            'host': 'eggs',
                            'keyloc': '~/.ssh/id_rsa.pub',
                            'protocol': 'sftp'}}}

…And notice how everything has been expanded.

Enjoy!

Python, PostgreSQL, and psycopg2’s Dusty Corners

By m0j0, December 1, 2009 10:07 pm

Last time I wrote code with psycopg2 was around 2006, but I was reacquainted with it over the past couple of weeks, and I wanted to make some notes on a couple of features that are not well documented, imho. Portions of this post have been snipped from mailing list threads I was involved in.

Calling PostgreSQL Functions with psycopg2

So you need to call a function. Me too. I had to call a function called ‘myapp.new_user’. It expects a bunch of input arguments. Here’s my first shot after misreading some piece of some example code somewhere:

qdict = {'fname': self.fname, 'lname': self.lname, 'dob': self.dob, 'city': self.city, 'state': self.state, 'zip': self.zipcode}

sqlcall = """SELECT * FROM myapp.new_user( %(fname)s, %(lname)s,
%(dob)s, %(city)s, %(state)s, %(zip)s""" % qdict

curs.execute(sqlcall)

There’s no reason this should work, or that anyone should expect it to work. I just wanted to include it in case someone else made the same mistake. Sure, the proper arguments are put in their proper places in ’sqlcall’, but they’re not quoted at all.

Of course, I foolishly tried going back and putting quotes around all of those named string formatting arguments, and of course that fails when you have something like a quoted “NULL” trying to move into a date column. It has other issues too, like being error-prone and a PITA, but hey, it was pre-coffee time.

What’s needed is a solution whereby psycopg2 takes care of the formatting for us, so that strings become strings, NULLs are passed in a way that PostgreSQL recognizes them, dates are passed in the proper format, and all that jazz.

My next attempt looked like this:

curs.execute("""SELECT * FROM myapp.new_user( %(fname)s, %(lname)s,
%(dob)s, %(city)s, %(state)s, %(zip)s""", qdict)

This is, according to some articles, blog posts, and at least one reply on the psycopg mailing list “the right way” to call a function using psycopg2 with PostgreSQL. I’m here to tell you that this is not correct to the best of my knowledge.The only real difference between this attempt and the last is I’ve replaced the “%” with a comma, which turns what *was* a string formatting operation into a proper SELECT with a psycopg2-recognized parameter list. I thought this would get psycopg2 to “just work”, but no such luck. I still had some quoting issues.

I have no idea where I read this little tidbit about psycopg2 being able to convert between Python and PostgreSQL data types, but I did. Right around the same time I was thinking “it’s goofy to issue a SELECT to call a function that doesn’t really want to SELECT anything. Can’t callproc() do this?” Turns out callproc() is really the right way to do this (where “right” is defined by the DB-API which is the spec for writing a Python database module). Also turns out that psycopg2 can and will do the type conversions. Properly, even (in my experience so far).

So here’s what I got to work:

callproc_params = [self.fname, self.lname, self.dob, self.city, self.state, self.zipcode]

curs.callproc('myapp.new_user', callproc_params)

This is great! Zero manual quoting or string formatting at all! And no “SELECT”. Just call the procedure and pass the parameters. The only thing I had to change in my code was to make my ’self.dob’ into a datetime.date() object, but that’s super easy, and after that psycopg2 takes care of the type conversion from a Python date to a PostgreSQL date. Tomorrow I’m actually going to try calling callproc() with a list object inside the second argument. Wish me luck!

A quick cursor gotcha

I made a really goofy mistake. At the root of it, what I did was share a connection *and a cursor object* among all methods of a class I created to abstract database operations out of my code. So, I did something like this (this is not the exact code, and it’s untested. Treat it like pseudocode):

class MyData(object):
   def __init__(self, dsn):
      self.conn = psycopg2.Connection(dsn)
      self.cursor = self.conn.cursor()

   def get_users_by_regdate(self, regdate, limit):
      self.cursor.arraysize = limit
      self.cursor.callproc('myapp.uid_by_regdate', regdate)
      while True:
         result = self.cursor.fetchmany()
         if not result:
            break
         yield result

   def user_is_subscribed(self, uid):
      self.cursor.callproc('myapp.uid_subscribed', uid)
      result = self.cursor.fetchone()
      val = result[0]
      return val

Now, in the code that uses this class, I want to grab all of the users registered on a given date, and see if they’re subscribed to, say, a mailing list, an RSS feed, a service, or whatever. See if you can predict the issue I had when I executed this:

    db = MyData(dsn)
    for id in db.get_users_by_regdate([joindate]):
        idcount += 1
        print idcount
        param = [id]
        if db.user_is_subscribed(param):
            print "User subscribed"
            skip_count += 1
            continue
        else:
            print "Not good"
            continue

Note that the above is test code. I don’t actually want to continue to the top of the loop regardless of what happens in production :)

So what I found happening is that, if I just commented out the portion of the code that makes a database call *inside* the for loop, I could print ‘idcount’ all the way up to thousands of results (however many results there were). But if I left it in, only 100 results made it to ‘db.user_is_subscribed’.

Hey, ‘100′ is what I’d set the curs.arraysize() to! Hey, I’m using the *same cursor* to make both calls! And with the for loop, the cursor is being called upon to produce one recordset while it’s still trying to produce the first recordset!

Tom Roberts, on the psycopg list, states the issue concisely:

The cursor is stateful; it only contains information about the last
query that was executed. On your first call to “fetchmany”, you fetch a
block of results from the original query, and cache them. Then,
db.user_is_subscribed calls “execute” again. The cursor now throws away all
of the information about your first query, and fetches a new set of
results. Presumably, user_is_subscribed then consumes that dataset and
returns. Now, the cursor is position at end of results. The rows you
cached get returned by your iterator, then you call fetchmany again, but
there’s nothing left to fetch…

…So, the lesson is if you need a new recordset, you create a new cursor.

Lesson learned. I still think it’d be nice if psycopg2 had more/better docs, though.

A Couple of Python Tools for Tornado or AMQP Users

By m0j0, November 23, 2009 9:20 am

Hi all,

Not lots of time to blog since starting my new gig as Senior Operations Developer at MyYearbook.com (which is pretty awesome, btw), but wanted to let anyone who’s interested know about a couple of tools I’ve released on github:

bunny is a CLI tool for doing quick tests and simple operations on an AMQP server (I use RabbitMQ for testing, but it should work with any AMQP server). It’ll do your basic create/delete of exchanges, queues, and bindings. It’s only capable of doing things that are part of AMQP, and I still need to finish the feature that tests functionality via a ’round trip’ test, where it sends a message, then gets it, then sends the body to stdout. If you find time before me to finish that, send a pull request or patch.

curtain is a mixin for use with the Tornado web server that provides HTTP Digest Authentication. I actually started with this recipe and altered/updated it in various ways. It’s still not complete according to RFC 2617 — I need to code for a few edge cases like getting a stale nonce, but what’s there does work, so if you (like me) were wondering what the performance comparison would look like between tornado doing digest auth and having a front-end proxy do it, have at it :)

Python Quirks in Cmd, urllib2, and decorators

By m0j0, October 28, 2009 10:06 pm

So, if you haven’t been following along, Python programming now occupies the bulk of my work day.

While I really like writing code and like it more using Python, no language is without its quirks. Let me say up front that I don’t consider these quirks bugs or big hulking issues. I’m not trying to bash the language. I’m just trying to help folks who trip over some of these things that I found to be slightly less than obvious.

Python’s Cmd Module and Handling Arguments

Using the Python Cmd module lets you create a program that provides an interactive shell interface to your users. It’s really simple, too. You just create a class that inherits from cmd.Cmd, and define a bunch of methods named do_<something>, where <something> is the actual command your user will run in your custom shell.

So if you want users to be able to launch your app, be greeted with a prompt, type “hello”, and have something happen in response, you just define a method called “do_hello” and whatever code you put there will be run when a user types “hello” in your shell. Here’s what that would look like:

import cmd

class MyShell(cmd.Cmd):
   def do_hello(self):
      print "Hello!"

# Kick off the shell
shell = MyShell()
shell.cmdloop()

Of course, what’s a shell without command line options and arguments? For example, I created a shell-based app using Cmd that allowed users to run a ‘connect’ command with arguments for host, port, user, and password. Within the shell, the command would look something like this:

> connect -h mybox -u jonesy -p mypass

Note that the “>” is the prompt, not part of the command.

The idea here is that you pass the arguments to the option flags, and you can set sane defaults in the application for missing args (for example, I didn’t provide a port here — I’m leaning on a default, but I did provide a host, since the default might be ‘localhost’).

Passing just one, single-word argument with Cmd is dead easy, because all of the command methods receive a string that contains *everything* on the line after the actual command. If you’re expecting such an argument, just make sure your ‘do_something’ method accepts the incoming string. So, to let users see what “hello” looks like in Spanish, we can accept “esp” as an argument to our command:

class MyShell(cmd.Cmd):
   def do_hello(self, arg):
      print "Hello! %s" % arg

The problems come when you want more than one argument, or when you want flags with arguments. For example, in the earlier “connect” example, my “do_connect” method is still only going to get one big, long string passed to it — not a list of arguments. So where in a normal program you might do something like:

class MyShell(cmd.Cmd):
   def do_connect(self, host='localhost', port='42', user='guest', password='guest'):
      #...connection code here...

In a Cmd method, you’re just going to define it like we did the do_hello method above: it takes ’self’ and ‘args’, where ‘args’ is one long line.

A couple of quick workarounds I’ve tried:

Parse the line yourself. I created a method in my Cmd app called ‘parseargs’ that just takes the big long line and returns a dictionary. My specific application only takes ‘name=value’ arguments, so I do this:

         d = dict([arg.split('=') for arg in args.split()])

And return the dictionary to the calling method. My connect method can then check for keys in the dictionary and set things up. It’s longer an a little more arduous, but not too bad.

Use optparse. You can instantiate a parser right inside your do_x methods. If you have a lot of methods that all need to take several flags and args, this could become cumbersome, but for one or two it’s not so bad. The key to doing this is creating a list from the Big Long Line and passing it to the parse_args() method of your parser object. Here’s what it looks like:

class MyShell(cmd.Cmd):
   def do_touch(self, line):
      parser = optparse.OptionParser()
      parser.add_option('-f', '--file', dest='fname')
      parser.add_option('-d', '--dir', dest='dir')
      (options,args) = parser.parse_args(line.split())

      print "Directory: %s" % options.dir
      print "File name: %s" % options.fname

This method is just an example, so don’t scratch your head looking for “import os” or anything :)

This is probably the more elegant solution, since it doesn’t require you to restrict your users to passing args in a particular way, and doesn’t require you to come up with fancy CLI argument parsing algorithms.

Using urllib2 for Pure XML Over HTTP

I wrote a web service client this week that does pure XML over HTTP to send queries to a service. I’ve written things like this before using Python, but it turns out, after looking back at my code, I was always either using XMLRPC, SOAP, or going through some wrapper that hid a lot from me in an effort to make my life easier (like the Google Data API). I’ve never had to try to send a pure XML payload over the wire to a web server.

I figured urllib2 was going to help me out here, and it did, but not before going through some pain due mainly to an odd pattern in various sources of documentation on the topic. I read docs at python.org, effbot.org, a couple of blogs, and did a Google search, and everything, everywhere, seems to indicate that the urllib2.Request object’s optional “data” argument expects a urlencoded string. From http://docs.python.org/library/urllib2.html?highlight=urllib2.request#urllib2.Request

data should be a buffer in the standard application/x-www-form-urlencoded format

The examples on every site I’ve found always pass whatever ‘data’ is through urllib.urlencode() before adding it to the request. I figured urllib2 was no longer my friend, and almost started looking at implementing an HTTPSClient object. Instead I decided to try just passing my unencoded data. What’s it gonna do, detect that my data wasn’t urlencoded? Maybe I’d learn something.

I learned that all of the documentation fails to account for this particular edge case. Go ahead and pass whatever the heck you want in ‘data’. If it’s what the server on the other end expects, you’ll be fine. :)

Decorators

I found myself in dark, dusty corners when I had to decide how and where inside of a much larger piece of code to implement a feature. I really wanted to use a decorator, and still think that’s what I’ll wind up doing, but then how to implement the decorator isn’t as straightforward as I’d like either.

Decorators are used to alter how a decorated function operates. They’re amazingly useful, because instead of implementing some bit of code in a bunch of methods that themselves live inside a bunch of classes across various modules, or creating an entire class or mixin to inherit from when you only need the code overhead in a couple of edge cases, you can just create a decorator and apply it only to the proper methods or functions.

The lesson I learned is to try very hard to make one solid decision about how your decorator will work up front. Will it be a class? That’s done somewhat differently than doing it with a function. Will the decorator take arguments? That’s handled differently in both implementations, and also requires changes to an existing decorator class that didn’t used to take arguments. I don’t know why I expected this to be more straightforward, but I totally did.

If you’re new to decorators or haven’t had to dig into them too deeply, I highly recommend Bruce Eckel’s series introducing Python decorators, which walks you through all of the various ways to implement them. Part I (of 3) is here.

New Job, Car, Baby, and Other News

By m0j0, October 22, 2009 9:02 pm

New Baby!

I know this is my geek blog, but geeks have kids too, so first I want to announce the birth of our second daughter, Sadie, who was born on September 15th. She’s now over a month old. This is the first time I’ve stayed up late enough to blog about her. Everyone is healthy, if slightly sleep-deprived :)

New Job!

The day before Sadie’s birth, I got a call with an offer for a job. A *full-time* job, as a Senior Operations Developer for MyYearbook.com. After learning about the cool and very geeky things going on at MyYearbook during the interview process, I couldn’t turn it down. I started on October 5, and it’s been a blast digging into all of the cool stuff going on there. While I’m certainly doing my fair share of PHP code review, maintenance, and general coding, I’m also getting plenty of hours in working out the Python side of my brain. I’m finding that while it’s easier switching gears than I had anticipated, I do make some really funny minor syntax errors, like using dot notation to access object attributes in PHP ;-P

What I find super exciting is something that might turn some peoples’ stomachs: at the end of my first week, I sat back and looked at my monitors to find roughly 15 tabs in Firefox open to pages explaining various tools I’d never gotten to use, protocols I’ve never heard of, etc. I had my laptop and desktop both configured with 2 virtual machines for testing and playing with new stuff. I had something north of 25 terminal windows open, and 8 files open in Komodo Edit.

Now THAT, THAT is FUN!

The projects I’m working on run the gamut from code cleanups that nobody else has had time to do (a good tool for getting my brain wrapped around various parts of the code base), to working on scalability solutions and new offerings involving my background in coding *and* system administration. It’s like someone cherry-picked a Bay Area startup and dropped it randomly 30 minutes from my house.

My own business is officially “not taking new clients”. I have some regular clients that I still do work for, so my “regulars” are still being served, but they’ve all been put on notice that I’m unavailable until the new year.

New Car!

I’m less excited about the new car, really. I used to drive a Jeep Liberty, and I loved it. However, in early September, before Sadie’s arrival, it became clear to me that putting two car seats in that beast wasn’t going to happen. The Jeep is great for drivers, and it has some cargo space. It’s not a great vehicle for passengers, though.

At the same time, I was running a business (this was before the job offer came along), and I was finding myself slightly uncomfortable delivering rather serious business proposals in a well-used 2003 Jeep. So, I needed something that could fit my young family (my oldest is 2 yrs), and that was presentable to clients. So, I got a Lexus ES350.

I like most things about the car, except for the audio system. It seems schizophrenic to me to have like 6 sound ‘zones’ to isolate the audio to certain sets of speakers, but then controls like bass and treble only go from 0 to 5. Huh? And the sound always sounds like it’s lying on the floor for some reason. It’s not at all immersive. The sound system on my Jeep completely kicked ass. I miss it. A lot.

Other News

I’ve submitted an article to Python Magazine about my (relatively) recent work with Django and my (temporarily stalled) overhaul of LinuxLaboratory.org, and my experiences with various learning resources related to Django. If you’re looking to get into Django, it’s probably a good read.

I’ve been getting into some areas of Python that were previously dark, dusty corners, so hopefully I’ll be writing more about Python here, because writing about something helps me to solidify things in my own brain. Short of that, it serves as a future reference point in case it didn’t get solidified enough :)

My sister launched The Dance Jones, a blog where she talks about fitness, balance, dance, and stuff I should probably pay much more attention to (I’m close to declaring war on my gut). Also, if you ever wanted to know how to shoulder shimmy (and who hasn’t wanted to do that?), you should check it out :)

Sys/DB Admin and Coder Seeks Others To Build Web “A-Team”

By m0j0, September 9, 2009 10:20 am

UPDATE: There’s no location requirement. I kind of assume that I’m not going to find the best people by geographically limiting my search for potential partners. :)

Me: Live in Princeton, NJ area. Over 10 years experience with UNIX/Linux administration, databases and data modeling, and PHP/Perl. About 3 years experience using Python for back-end scripting and system automation, and less than a year of Django experience. Former Director of Technology for AddThis.com (it was bought out), Infrastructure Architect at cs.princeton.edu, and systems consultant/trainer. Creator of Python Magazine, former Editor in Chief of both php|architect *and* Python Magazine, and co-author of “Linux Server Hacks, volume 2″ (O’Reilly).

You are one of these:

  • Web graphic designer who has worked on several web-based projects for clients in various industries, understands current best practices and standards, has the tools and experience necessary to create custom graphics, and has some familiarity (secondarily) with PHP and/or Python, Javascript and Ajax. If you regularly make use of table-based web designs or ActiveX controls, this isn’t you.
  • Hardcore web developer with at least 6 years experience doing nothing but web-based projects using Javascript and (at some point) *both* PHP and Python, and has worked with or has an interest in Django, Cake, and other frameworks, and understands that client needs often don’t coincide with the religion of fanboyism. Knowledge of Javascript, Ajax, web standards and security is essential here. If your last “big project” was volunteer work to build a website for your kid’s soccer team, this isn’t you.
  • A generalist webmaster (sysadmin/db admin/scripter) with at least 6 years experience working in production *nix environments with good familiarity in the areas of high availability, web servers (specifically Apache), proxy servers and monitoring, and has worked with/supported users like the ones mentioned above on web-based projects. If you have to look at the documentation to figure out how to implement a 301 redirect, this probably isn’t you.

Experience working on a team in larger projects with multiple people would be good. Note that I’m looking for people to partner with on projects, I’m not hiring full time employees. Future partnership in a proper business is certainly a possibility, but… baby steps! I do have a couple of domains that would be great for use with this kind of project if it ever progresses that far :-)

I know that other people are out there looking for people to partner with on projects, but there doesn’t appear to be a common place for them to interact. Maybe that can be a project we undertake together :)   — if there *is* a place where people meet up for this kind of thing, let me know!

Let’s have fun, and take over the world! Shoot me an email at “bkjones” @ Google’s mail domain.

Panorama Theme by Themocracy