Category: Linux

Brain Fried Over NoSQL

By m0j0, June 26, 2010 10:16 pm

So, I’m working on a pet project. It’s in stealth mode. Just kidding — I don’t believe in stealth mode ;-)

It’s a twitter analytics dashboard that actually does useful things with the mountains of data available from the various Twitter APIs. I’m writing it in Python using Tornado. Here’s the first mockup I ever did for it, just like 2 nights ago:

It’s already a lot of fun. I’ve worked with Tornado before and like it a lot. I have most of the base infrastructure questions answered, because this is a pet project and they’re mostly easy and in some sense “don’t matter”. But that’s what has me stuck.

It Doesn’t Matter

It’s true. Past a certain point, belaboring choices of what tools to use where is pointless and is probably premature optimization. I’ve been working with startups for the past few years, and I’m painfully aware of what happens when a company takes too long to react to their popularity. I want to architect around that at the start, but I’m resisting. It’s a pet project.

But if it doesn’t matter, that means I can choose tools that are going to be fun to dig into and learn about. I’ve been so busy writing code to help avoid or buffer impact to the database that I haven’t played a whole lot with the NoSQL choices out there, and there are tons of them. And they all have a different world view and a unique approach to providing solutions to what I see as somewhat different problems.

Why NoSQL?

Why not? I’ve been working with relational database systems since 1998. I worked on large data reporting projects, a couple of huge data warehousing projects, financial transaction systems, I worked for Sybase as a consulting DBA and project manager for a while, I was into MySQL and PostgreSQL by 2000, used them in production environments starting around 2001-02… I understand them fairly well. I also understand BDB and other “flat-file” databases and object stores. SQLite has become unavoidable in the past few years as well. It’s not like I don’t understand the compromises I’m making going to a NoSQL system.

There’s a good bit of talk from the RDBMS camp (seriously, why do they need their own camp?) about why NoSQL is bad. Lots of people who know me  would put me in the RDBMS camp, and I’m telling you not to cry yourself to sleep out of guilt over a desire to get to know these systems. They’re interesting, and they solve some huge issues surrounding scalability with greater ease than an RDBMS.

Like what? Well, cost for one. If I could afford Oracle I’d sooner use that than go NoSQL in all likelihood. I can’t afford it. Not even close. Oracle might as well charge me a small planet for their product. It’s great stuff, but out of reach. And what about sharding? Sharding a relational database sucks, and to try to hide the fact that it sucks requires you to pile on all kinds of other crap like query proxies, pools, and replication engines, all in an effort to make this beast do something it wasn’t meant to do: scale beyond a single box. All this stuff also attempts to mask the reality that you’ve also thrown your hands in the air with respect to at least 2 letters that make up the ACID acronym. What’s an RDBMS buying you at that point? Complexity.

And there’s another cost, by the way: no startup I know has the kind of enormous hardware that an enterprise has. They have access to commodity hardware. Pizza boxes. Don’t even get me started on storage. I’ve yet to see SSD or flash storage at a startup. I currently work at MyYearbook.com, and there are some pretty hefty database servers there, but it can hardly be called a startup anymore. Hell, they’re even profitable! ;-)

Where Do I Start?

One nice thing about relationland is I know the landscape pretty well. Going to NoSQL is like dropping me in a country I’ve never heard of where I don’t really speak the language. I have some familiarity with key-value stores from dealing with BDB and Memcache, and I’ve played with MongoDB a bit (using pymongo), but that’s just the tip of the iceberg.

I heard my boss mention Tokyo Tyrant a few times, so I looked into it. It seems to be one of the more obscure solutions out there from the standpoint of adoption, community, documentation, etc., but it does appear to be very capable on a technical level. However, my application is going to be number-heavy, and I’m not going to need to own all of the data required to provide the service. I can probably get away with just incrementing counters in Memcache for some of this work. For persistence I need something that will let me do aggregation *FAST* without having to create aggregation tables, ideally. Using a key/value store for counters really just seems like a no-brainer.

That said, I think what I’ve decided to do, since it doesn’t matter, is punt on this decision in favor of getting a working application up quickly.

MySQL

Yup. I’m going to pick one or two features of the application to implement as a ‘first cut’, and back them with a MySQL database. I know it well, Tornado has a built-in interface for it, and it’s not going to be a permanent part of the infrastructure (otherwise I’d choose PostgreSQL in all likelihood).

To be honest, I don’t think the challenge in bringing this application to life are really related to the data model or the engine/interface used to access it (though if I’m lucky that’ll be a major part of keeping it alive). No, the real problem I’m faced with is completely unrelated to these considerations…

Twitter’s API Service

Not the API itself, per se, but the service providing access to it, and the way it’s administered, is going to be a huge challenge. It’s not just the Twitter website that’s inconsistent, the API service goes right along. Not only that, but the type of data I really need to make this application useful isn’t immediately available from the API as far as I can tell.

Twitter maintains rate limits on the API. You can only make so many calls over so short a period of time. That alone makes providing an application like this to a lot of people a bit of a challenge. Compounding the issue is that, when there are failwhales washing up on the shores, those limits can be dynamically decreased. Ugh.

I guess it’s not a project for the faint of heart, but it’ll drive home some golden rules that are easy to neglect in other projects, like planning for failure (of both my application, and Twitter). Also, it’ll be a lot of fun.

Why Open Shop In California?

By m0j0, June 3, 2010 7:44 am

DISCLAIMER: I live on the East Coast, so these are perceptions and opinions that I don’t put forth as facts. I’m more asking a question to start a dialog than professing knowledge.

So, I just heard a report claiming that there are more IT jobs than techs to fill them in Southern California. Anyone who ever reads a tech job board and/or TechCrunch has also no doubt taken note that a vast majority of startups seem to be starting up there, and that there are just a metric asston of jobs there anyway.

This boggles my mind. This is a place with an extremely high cost of living, making labor more expensive. At the same time, aren’t there rolling power outages in CA? Does that not effect corporations or something? Do they just move their datacenters across the border to another state?

Between what I would think is an amazingly high labor cost and what I would think is an unfavorable place in terms of simple things like availability of power, I would think more places would look elsewhere for expansion or startups.

I live within spitting distance of at least 5 universities with engineering departments that I think would rate at the very least “solid”, many would rate better. I would guess that I could get to any Ivy League school in 6 hours or less, driving (3 are within an hour of my NJ home). MIT and Stevens are very good non-Ivy schools, and lots of other ones like Rutgers, NJIT, Penn State, NYU, and lots more are here, and those are just a few of the ones between NYC and Philadelphia, which are less than 2 hours apart. So…. there’s a labor pool here.

Is it tax breaks? Some aspect of the political atmosphere? Transportation? Is San Francisco such a clean, safe, friendly city that you just deal with the nonsense to live there?

What’s your take on this?

Python IDE Frustration

By m0j0, May 13, 2010 9:22 pm

I didn’t think I was looking for a lot in an IDE. Turns out what I want is impossibly hard to find.

In the past 6 months I’ve tried (or tried to try):

  • Komodo Edit
  • Eclipse w/ PyDev
  • PyCharm (from the first EAP build to… yesterday)
  • Wingware
  • Textmate

Wingware

First, let’s get Wingware out of the way. I’m on a Mac, and if you’re not going to develop for the Mac, I’m not going to pay you hundreds of dollars for your product. Period. I don’t even use free software that requires X11. Lemme know when you figure out that coders like Macs and I’ll try Wingware.

Komodo Edit

Well, I wanted to try the IDE but I downloaded it, launched it once for 5 minutes (maybe less), forgot about it, and now my trial is over. I’ll email sales about this tomorrow. In the meantime, I use Komodo Edit.

Komodo Edit is pretty nice. One thing I like about it is that it doesn’t really go overboard forcing its world view down my throat. If I’m working on bunny, which is a one-file Python project I keep in a git repository, I don’t have to figure out their system for managing projects. I can just “Open File” and use it as a text editor.

It has “ok” support for Vi key bindings, and it’s not a plugin: it’s built in. The support has some annoying limitations, but for about 85% of what I need it to do it’s fine. One big annoyance is that I can’t write out a file and assign it a name (e.g. ‘:w /some/filename.txt’). It’s not supported.

Komodo Edit, unless I missed it, doesn’t integrate with Git, and doesn’t offer a Python console. Its capabilities in the area of collaboration in general are weak. I don’t absolutely have to have them, but things like that are nice for keeping focused and not having to switch away from the window to do anything else, so ideally I could get an IDE that has this. I believe Komodo IDE has these things, so I’m looking forward to trying it out.

Komodo is pretty quick compared to most IDEs, and has always been rock solid stable for me on both Mac and Linux, so if I’m not in the mood to use Vim, or I need to work on lots of files at once, Komodo Edit is currently my ‘go-to’ IDE.

PyCharm

PyCharm doesn’t have an officially supported release. I’ve been using Early Adopter Previews since the first one, though. When it’s finally stable I’m definitely going to revisit it, because to be honest… it’s kinda dreamy.

Git integration is very good. I used it with GitHub without incident for some time, but these are early adopter releases, and things happen: two separate EAP releases of PyCharm made my project files completely disappear without warning, error, or any indication that anything was wrong at all. Of course, this is git, so running ‘git checkout -f’ brought things back just fine, but it’s unsettling, so now I’m just waiting for the EAP to be over with and I’ll check it out when it’s done.

I think for the most part, PyCharm nails it. This is the IDE I want to be using assuming the stability issues are worked out (and I don’t have reason to believe they won’t be). It gives me a Python console, VCS integration, a good class and project browser, some nice code analytics, and more complex syntax checking that “just works” than I’ve seen elsewhere. It’s a pretty handsome, very intuitive IDE, and it leverages an underlying platform whose plugins are available to PyCharm users as well, so my Vim keys are there (and, by the way, the IDEAVim plugin is the most advanced Vim support I’ve seen in any IDE, hands down).

Eclipse with PyDev

One thing I learned from using PyCharm and Eclipse is that where tools like this are concerned, I really prefer a specialized tool to a generic one with plugins layered on to provide the necessary functionality. Eclipse with PyDev really feels to me like a Java IDE that you have to spend time laboriously chiseling, drilling, and hammering to get it to do what you need if you’re not a Java developer. The configuration is extremely unintuitive, with a profuse array of dialogs, menus, options, options about options and menus, menus about menus and options… it never seems to end.

All told, I’ve probably spent the equivalent of 2 working days mucking with Eclipse configuration, and I’ve only been able to get it “pretty close” to where I want it. The Java-loving underpinnings of the Eclipse platform simply cannot be suppressed, while things I had to layer on with plugins don’t show up in the expected places.

Add to this Eclipse’s world-view, which reads something like “there is no filesystem tree: only projects”, and you have a really damned annoying IDE. I’ve tried on and off for over a year to make friends with Eclipse because of the good things I hear about PyDev, but it just feels like a big hacky, duct-taped mess to me, and if PyCharm has proven anything to me, it’s that building a language specific IDE on an underlying platform devoted to Java doesn’t have to be like this. When I finally got it to some kind of usable point, and after going through the “fonts and colors” maze, it turns out the syntax highlighting isn’t really all that great!

A quick word about Vi key bindings in Eclipse: it’s not a pretty picture, but the best I’ve been able to find is a free tool called Vrapper. It’s not bad. I could get by with Vrapper, but I don’t believe it’s as mature and evolved as IDEAVim plugin in PyCharm.

So, I’ll probably turn back to Eclipse for Java development (I’m planning on taking on a personal Android project), but I think I’ve given up on it for anything not Java-related.

Vim

Vim is technically ‘just an editor’, but it has some nice benefits, and with the right plugins, it can technically do all of the things a fancy IDE can. I use the taglist plugin to provide the project and class browser functionality, and the kicker here is that you can actually switch to the browser pane, type ‘/’ and the object or member you’re looking for, and jump to it in a flash. It’s also the most complete Vim key binding implementation available ;-)

The big win for me in using Vim though is remote work. Though I’d rather do all of my coding locally, there are times when I really have to write code on remote machines, and I don’t want to go through the rigmarole of coding, pushing my changes, going to my terminal, pulling down the changes, testing, failing, fixing the code on my machine, pushing my changes, pulling my changes… ugh.

So why not just use Vim? I could do it. I’ve been using Vim for many years and am pretty good with it, but I just feel like separating my coding from my terminal whenever I can is a good thing. I don’t want my code to look like my terminal, nor do I want my terminal to look like my IDE theme. I’m SUPER picky about fonts and colors in my IDE, and I’m not that picky about them in my terminal. I also want the option of using my mouse while I’m coding, mostly to scroll, and getting that to work on a Mac in Terminal.app isn’t as simple as you might expect (and I’m not a fan of iTerm… and its ability to do this comes at a cost as well).

MacVim is nice, solves the separation of Terminal and IDE, and I might give it a more serious try, but let’s face it, it’s just not an IDE. Code completion is still going to be mediocre, the interface is still going to be terminal-ish… I just don’t know. One thing I really love though is the taglist plugin. I think if I could just find a way to embed a Python console along the bottom of MacVim I might be sold.

One thing I absolutely love about Vim, the thing that Vim gets right that none of the IDEs get is colorschemes: MacVim comes with like 20 or 30 colorschemes! And you can download more on the ‘net! The other IDEs must lump colorscheme information into the general preferences or something, because you can’t just download a colorscheme as far as I’ve seen. The IDE with the worst color/font configuration? Eclipse – the one all my Python brethren seem to rave about. That is so frustrating. Some day I’ll make it to PyCon and someone will show me the kool-aid I guess.

The Frustrating Conclusion

PyCharm isn’t soup yet, Wingware is all but ignoring the Mac platform, Eclipse is completely wrong for my brain and I don’t know how anyone uses it for Python development, Komodo Edit is rock solid but lacking features, and Komodo IDE is fairly pricey and a 30-day trial is always just really annoying (and I kinda doubt it beats PyCharm for Python-specific development). MacVim is a stand-in for a real IDE and it does the job, but I really want more… integration! I also don’t like maintaining the plugins and colorschemes and *rc files and ctags, and having to understand its language and all that.

I don’t cover them here, but I’ve tried a bunch of the Linux-specific Python IDEs as well, and I didn’t like a single one of them at all. At some point I’ll spend more time with those tools to see if I missed something crucial that, once learned, might make it hug my brain like a warm blanket (and make me consider running Linux on my desktop again, something I haven’t done on a regular ongoing basis in about 4 years).

So… I don’t really have an IDE yet. I *did* however just realize that the laptop I’m typing on right now has never had a Komodo IDE install, so I’m off to test it now. Wish me luck!

Per-machine Bash History

By m0j0, May 10, 2010 2:12 pm

I do work on a lot of machines no matter what environment I’m working in, and a lot of the time each machine has a specific purpose. One thing that really annoys me when I work in an environment with NFS-mounted home directories is that if I log into a machine I haven’t used in some time, none of the history specific to that machine is around anymore.

If I had a separate ~/.bash_history file on each machine, this would likely solve the problem. It’s pretty simple to do as it turns out. Just add the following lines to ~/.bashrc:

srvr=`hostname`
export HISTFILE="/home/jonesy/.bash_history_${srvr}"

Don’t be alarmed when you source ~/.bashrc and you don’t see the file appear in your home directory. Unless you’ve configured things otherwise, history is only written at the end of a bash session. So go ahead and source bashrc, run a few commands, end your session, log back in, and the file should be there.

I’m not actually sure if this is going to be a great idea for everyone. If you work in an environment where you run the same commands from machine to machine, it might be better to just leave things alone. For me, I’m running different psql/mysql connection commands and stuff like that which differ depending on the machine I’m on and the connection perms it has.

PyTPMOTW: py-amqplib

By m0j0, April 3, 2010 9:56 pm

What’s This Module For?

To interact with a queue broker implementing version 0.8 of the Advanced Message Queueing Protocol (AMQP) standard. Copies of various versions of the specification can be found here. At time of writing, 0.10 is the latest version of the spec, but it seems that many popular implementations used in production environments today are still using 0.8, presumably awaiting a finalization of v.1.0 of the spec, which is a work in progress.

What is AMQP?

AMQP is a queuing/messaging protocol that is implemented by server daemons (called ‘brokers’) like RabbitMQ, ActiveMQ, Apache Qpid, Red Hat Enterprise MRG, and OpenAMQ. Though messaging protocols used in the enterprise are historically proprietary, AMQP has a bold and vocal stance that AMQP will be:

  • Broadly applicable for enterprise use
  • Totally open
  • Platform agnostic
  • Interoperable

The working group consists of several huge enterprises who have a vested interest in a protocol that meets these requirements. Most are either huge enterprises who are (or were) a victim of the proprietary lock-in that came with what will now likely become ‘legacy’ protocols, or implementers of the protocols, who will sell products and services around their implementation. Here’s a brief list of those involved in the AMQP working group:

  • JPMorgan Chase (the initial developers of the protocol, along with iMatix)
  • Goldman Sachs
  • Red Hat Software
  • Cisco Systems
  • Novell

Message brokers can facilitate an awfully large amount of flexibility in an architecture. They can be used to integrate applications across platforms and languages, enable asynchronous operations for web front ends, modularize and more easily distribute complex processing operations.

Basic Publishing

The first thing to know is that when you code against an AMQP broker, you’re dealing with a hierarchy: a ‘vhost’ contains one or more ‘exchanges’ which themselves can be bound to one or more ‘queues’. Here’s how you can programmatically create an exchange and queue, bind them together, and publish a message:

from amqplib import client_0_8 as amqp

conn = amqp.Connection(userid='guest', password='guest', host='localhost', virtual_host='/', ssl=False)

# Create a channel object, queue, exchange, and binding.
chan = conn.channel()
chan.queue_declare('myqueue', durable=True)
chan.exchange_declare('myexchange', type='direct', durable=True)
chan.queue_bind('myqueue', 'myexchange', routing_key='myq.myx')

# Create an AMQP message object

msg = amqp.Message('This is a test message')
chan.basic_publish(msg, 'myexchange', 'myq.myx')

As far as we know, we have one exchange and one queue on our server right now, and if that’s the case, then technically the routing key I’ve used isn’t required. However, I strongly suggest that you always use a routing key to avoid really odd (and implementation-specific) behavior like getting multiple copies of a message on the consumer side of the equation, or getting odd exceptions from the server. The routing key can be arbitrary text like I’ve used above, or you can use a common formula of using ‘.’ as your routing key. Just remember that without the routing key, the minute more than one queue is bound to an exchange, the exchange has no way of knowing which queue to route a message to. Remeber: you don’t publish to a queue, you publish to an exchange and tell it which queue it goes in via the routing key.

Basic Consumption

Now that we’ve published a message, how do we get our hands on it? There are two methods: basic_get, which will ‘get’ a single message from the queue, or ‘basic_consume’, which technically doesn’t get *any* messages: it registers a handler with the server and tells it to send messages along as they arrive, which is great for high-volume messaging operations.

Here’s the ‘basic_get’ version of a client to grab the message we just published:

msg = chan.basic_get(queue='myqueue', no_ack=False)
chan.basic_ack(msg.delivery_tag)

In the above, I’ve used the same channel I used to publish the message to get it back again using the basic_get operation. I then acknowledged receipt of the message by sending the server a ‘basic_ack’, passing along the delivery_tag the server included as part of the incoming message.

Consuming Mass Quantities

Using basic_consume takes a little more thought than basic_get, because basic_consume does nothing more than register a method with the server to tell it to start sending messages down the pipe. Once that’s done, however, it’s up to you to do a chan.wait() to wait for messages to show up, and find some elegant way of breaking out of this wait() operation. I’ve seen and used different techniques myself, and the right thing will depend on the application.

The basic_consume method also requires a callback method which is called for each incoming message, and is passed the amqp.Message object when it arrives.

Here’s a bit of code that defines a callback method, calls basic_consume, and does a chan.wait():

consumer_tag = 'foo'
def process(msg):
   txt = msg.body
   if '-1' in txt:
      print 'Got -1'
      chan.basic_cancel(consumer_tag)
      chan.close()
   else:
      print 'Got message!'

chan.basic_consume('messages', callback=process, consumer_tag=consumer_tag)
while True:
   print 'Message processed. Next?'
   try:
      chan.wait()
   except IOError as out:
      print "Got an IOError: %s" % out
      break
   if not chan.is_open:
      print "Done processing. Later"
      break

So, basic_consume tells the server ‘Start sending any and all messages!’. The server registers a method with a name given by the consumer_tag argument, or it assigns one and it becomes the return value of basic_consume(). I define one here because I don’t want to run into race conditions where I want to call basic_cancel() with a consumer_tag variable that doesn’t exist yet, or is out of scope, or whatever. In the callback, I look for a sentinel message whose body contains ‘-1′, and at that point I call basic_cancel (passing in the consumer_tag so the server knows who to stop sending messages to), and I close the channel. In the ‘while True’, the channel object checks its status and exits if it’s not open.

The above example starts to uncover some issues with py-amqplib. It’s not clear how errors coming back from the server are handled, as opposed to errors caused by the processing code, for example. It’s also a little clumsy trying to determine the logic for breaking out of the loop. In this case there’s a sentinel message sent to the queue representing the final message on the stack, at which point our ‘process()’ callback closes the channel, but then the channel has to check its own status to move forward. Just returning False from process() doesn’t break out of the while loop, because it’s not looking for a return value from that function. We could have our process() function raise an error of its own as well, which might be a bit more elegant, if also a bit more work.

Moving Ahead

What I’ve covered here actually covers perhaps 90% of the common cases for amqplib, but there’s plenty more you can do with it. There are various exchange types, including fanout exchanges and topic exchanges, which can facilitate more interesting messaging and pub/sub models. To learn more about them, here are a couple of places to go for information:

Broadcasting your logs with RabbitMQ and Python
Rabbits and Warrens
RabbitMQ FAQ section “Messaging Concepts: Exchanges

Quick Loghetti Update

By m0j0, March 15, 2010 7:23 pm

For the familiar and impatient: Loghetti has moved to github and has been updated. An official release hasn’t been made yet, but cloning the repository and installing argparse will result in perfectly usable code. More on the way.

For the uninitiated, Loghetti is a command line log sifting/reporting tool written in Python to parse Apache Combined Format log files. It was initially released in late 2008 on Google Code. I used loghetti for my own work, which involved sifting log files with tens of millions of lines. Needless to say, it needed to be reasonably fast, and give me a decent amount of control over the data returned. It also had to be easy to use; just because it’s fast doesn’t mean I want to retype my command because of confusing options or the like.

So, loghetti is reasonably fast, and reasonably easy, and gives a reasonable amount of control to the end user. It’s certainly a heckuva lot easier than writing regular expressions into ‘grep’ and doing the ol’ ‘press & pray’.

Loghetti suffered a bit over the last several months because one of its dependencies broke backward compatibility with earlier releases. Such is the nature of development. Last night I finally got to crack open the code for loghetti again, and was able to put a solution together in an hour or so, which surprised me.

I was able to completely replace Doug Hellmann’s CommandLineApp with argparse very, very quickly. Of course, CommandLineApp was taking on responsibility for actually running the app itself (the main loghetti class was a subclass of CommandLineApp), and was dealing with the options, error handling, and all that jazz. It’s also wonderfully generic, and is written so that pretty much any app, regardless of the type of options it takes, could run as a CommandLineApp.

argparse was not a fast friend of mine. I stumbled a little over whether I should just update the namespace of my main class via argparse, or if I should pass in the Namespace object, or… something else. Eventually, I got what I needed, and not much more.

So loghetti now requires argparse, which is not part of the standard library, so why replace what I knew with some other (foreign) library? Because argparse is, as I understand it, slated for inclusion in Python 3, at which point optparse will be deprecated.

So, head on over to the GitHub repo, give it a spin, and send your pull requests and patches. Let the games begin!

CodeKata 4: Data Munging

By m0j0, December 28, 2009 11:10 pm

I’m continuing to take on the items in Dave Thomas’s Code Kata collection. It’s a nice way to spend a Sunday night, and it’s a good way to get my brain going again before work on Monday morning. It’s also fun and educational :)

CodeKata 4 is called “Data Munging”. It’s not very difficult data munging, really. I think the more interesting bit of the Kata is how to deal with files in different languages, what tools are well suited to the task, and trying to minimize code duplication.

Description

Code Kata 4 instructs us to download weather.dat, a listing of weather information for each day of June, 2002 in Morristown, NJ. We’re to write a program that identifies the day which has the smallest gap between min and max temperatures.

Then, it says to download football.dat, a listing of season statistics for Soccer teams. We’re to write a program that identifies the team with the smallest gap between goals scored for the team, and goals scored against the team.

Once those are done, we’re asked to try to factor out as much duplicate code as possible between the two programs, and then we’re asked a few questions to help us think more deeply about what just transpired.

My Attack Plan

The first thing that jumped into my brain when I saw the data files was “Awk would be perfect for this”. I fiddled with that for a little too long (my awk is a little rusty), and came up with this (for weather.dat):

>$ awk 'BEGIN {min=100000}; $1 ~ /^[1-3]/ {x[$1]=$2-$3; if (x[$1]<min){ min=x[$1]; winner=$1}} END {print winner, min} ' weather.dat.dat
14 2

It works, and awk, though it gets ugly to some, reads in a nice, linear way to me. You give it a filter expression, and then statements to act on the matching lines (in braces). What could be simpler?

After proving to myself that I hadn’t completely lost my awk-fu, I went about writing a Python script to deal with the problem. I read ahead in the problem description, though, and so my script contains separate blocks for the two data sets in one script:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()

if 'weather' in sys.argv[1]:
   winner = 1000000
   winnerday = None

   for line in data:
      #filter out lines that aren't numbered.
      if line.strip().startswith(tuple(string.digits)):
         # we only need the first three fields to do our work
         l = line.split()[:3]
         # some temps have asterisks attached to them.
         maxt = l[1].strip(string.punctuation)
         mint = l[2].strip(string.punctuation)
         diff = int(maxt) - int(mint)
         if diff < winner:
            winner = diff
            winnerday = l[0]
   print "On day %s, the temp difference was only %d degrees!" % (winnerday, winner)

if 'football' in sys.argv[1]:
   winner = 1000000
   winnerteam = None

   for line in data:
      if line.strip().startswith(tuple(string.digits)):
         l = line.split()
         team, f, a = l[1], int(l[6]), int(l[8])
         diff = abs(f - a)
         if diff < winner:
            winner = diff
            winnerteam = team
   print "Team %s had a for/against gap of only %d points!" % (winnerteam, winner)

Really, the logic employed is not much different from the awk solution:

  1. Set a default for ‘winner’ that’s unlikely to be rivaled by the data :)
  2. Set the default for the winning team or day to ‘None’
  3. Filter out unwanted lines in the dataset.
  4. Grab bits of each line that are useful.
  5. Assign each useful bit to a variable.
  6. Do math.
  7. Do comparisons
  8. Present results.

Refactoring

Part 2 of the Kata says to factor out as much duplicate code as possible. I was able to factor out almost all of it on the first shot at refactoring, leaving only the line of code (per file) to identify the relevant columns of each data set:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()
winner_val = 1000000
winner_id = None

for line in data:
   if line.strip().startswith(tuple(string.digits)):
      l = line.split()
      if 'weather' in sys.argv[1]:
         identifier, minuend, subtrahend = l[0], int(l[1].strip(string.punctuation)), int(l[2].strip(string.punctuation))
      elif 'football' in sys.argv[1]:
         identifier, minuend, subtrahend = l[1], int(l[6]), int(l[8])
      diff = abs(minuend - subtrahend)
      if diff < winner_val:
         winner_val = diff
         winner_id = identifier

print winner_id, winner_val

Not too bad. I could’ve done some other things to make things work differently: for example I could’ve let the user feed in the column header names of the ‘identifier’, ‘minuend’ and ‘subtrahend’ columns in each data set, and then I could just *not* parse out the header line and instead use it to identify the list index positions of the bits I need for each line. It’d make the whole thing ‘just work’. It would also require more effort from the user. On the other hand, it would make things “just work” for just about any file with numbered lines of columnar data.

I have to admit that the minute I see columnar data like this, awk is the first thing I reach for, so I’m sure this affected my Python solution. The good news there is that my thinking toward columnar data is consistent, and so I treated both files pretty much the same way, making refactoring a 5-minute process.

In all, I enjoyed this Kata. Though I didn’t take it as far as I could have, it did make me think about how it could be improved and made more generic. Those improvements could incur a cost in terms of readability I suppose, but I think for this example it wouldn’t be a problem. I’m working on a larger project now where I have precisely this issue of flexibility vs. readability though.

I’m reengineering a rather gangly application to enable things like pluggable… everything. It talks to networked queues, so the protocol is pluggable. It talks to databases, so the database back end is pluggable, in addition to the actual data processing routines. Enabling this level of flexibility introduces some complexity, and really requires good documentation if we reasonably expect people to work with our code (the project should be released as open source in the coming weeks). Without the documentation, I’m sure I’d have trouble maintaining the code myself!

PyYaml with Aliases and Anchors

By m0j0, December 22, 2009 8:10 am

I didn’t know this little tidbit until yesterday and want to get it posted so I can refer to it later.

I have this YAML config file that’s kinda long and has a lot of duplication in it. This isn’t what I’m working on, but let’s just say that you have a bunch of backup targets defined in your YAML config file, and your program rocks because each backup target can be defined to go to a different destination. Awesome, right?

Well, it might be, but it might also just make your YAML config file grotesque (and error-prone). Here’s an example:

Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination:
            host: bar
            dir: /mnt/array23/homes/jonesy
            check_space: true
            min_space: 80G
            num_archives: 4
            compress: bzip2

Now with two backups, this isn’t so bad. But if your environment has 100 backup targets and only one destination, or…. heck — even if there are three destinations — should you have to write out the definition of those same three destinations for each of 100 backup targets? What if you need to change how one of the destinations is connected to, or the name of a destination changes, or array23 dies?

Ideally, you’d be able to reference the same definition in as many places as you need it and have things “just work”, and if something needs to change, you just change it in one place. Enter anchors and aliases.

An anchor is defined just like anything else in YAML with the exception that you get to label the definition block using “&labelname”, and then you can (de)reference it elsewhere in your config with “*labelname”. So here’s how our above configuration would look:

BackupDestination-23: &Backup_To_ARRAY23
    host: bar
    dir: /mnt/array23/homes/jonesy
    check_space: true
    min_space: 80G
    num_archives: 4
    compress: bzip2
Backups:
    Home_Jonesy:
        host: foo
        dir: /Users/jonesy
        protocol: ssh
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23
    Home_Molly:
        host: eggs
        dir: /Users/molly
        protocol: sftp
        keyloc: ~/.ssh/id_rsa.pub
        Destination: *Backup_To_ARRAY23

With only two backup targets, the benefit is small, but keep trying to imagine this config file with about 100 backup targets, and only one or two destinations. This removes a lot of duplication and makes things easier to change and maintain (and read!)

The cool thing about it is that if you already have code that reads the YAML config file, you don’t have to change it at all — PyYaml expands everything for you. Here’s a quick interpreter session:

>>> import yaml
>>> from pprint import pprint
>>> stream = file('foo.yaml', 'r')
>>> cfg = yaml.load(stream)
>>> pprint(cfg)
{'BackupDestination-23': {'check_space': True,
                          'compress': 'bzip2',
                          'dir': '/mnt/array23/homes/jonesy',
                          'host': 'bar',
                          'min_space': '80G',
                          'num_archives': 4},
 'Backups': {'Home_Jonesy': {'Destination': {'check_space': True,
                                             'compress': 'bzip2',
                                             'dir': '/mnt/array23/homes/jonesy',
                                             'host': 'bar',
                                             'min_space': '80G',
                                             'num_archives': 4},
                             'dir': '/Users/jonesy',
                             'host': 'foo',
                             'keyloc': '~/.ssh/id_rsa.pub',
                             'protocol': 'ssh'},
             'Home_Molly': {'Destination': {'check_space': True,
                                            'compress': 'bzip2',
                                            'dir': '/mnt/array23/homes/jonesy',
                                            'host': 'bar',
                                            'min_space': '80G',
                                            'num_archives': 4},
                            'dir': '/Users/molly',
                            'host': 'eggs',
                            'keyloc': '~/.ssh/id_rsa.pub',
                            'protocol': 'sftp'}}}

…And notice how everything has been expanded.

Enjoy!

Python, PostgreSQL, and psycopg2′s Dusty Corners

By m0j0, December 1, 2009 10:07 pm

Last time I wrote code with psycopg2 was around 2006, but I was reacquainted with it over the past couple of weeks, and I wanted to make some notes on a couple of features that are not well documented, imho. Portions of this post have been snipped from mailing list threads I was involved in.

Calling PostgreSQL Functions with psycopg2

So you need to call a function. Me too. I had to call a function called ‘myapp.new_user’. It expects a bunch of input arguments. Here’s my first shot after misreading some piece of some example code somewhere:

qdict = {'fname': self.fname, 'lname': self.lname, 'dob': self.dob, 'city': self.city, 'state': self.state, 'zip': self.zipcode}

sqlcall = """SELECT * FROM myapp.new_user( %(fname)s, %(lname)s,
%(dob)s, %(city)s, %(state)s, %(zip)s""" % qdict

curs.execute(sqlcall)

There’s no reason this should work, or that anyone should expect it to work. I just wanted to include it in case someone else made the same mistake. Sure, the proper arguments are put in their proper places in ‘sqlcall’, but they’re not quoted at all.

Of course, I foolishly tried going back and putting quotes around all of those named string formatting arguments, and of course that fails when you have something like a quoted “NULL” trying to move into a date column. It has other issues too, like being error-prone and a PITA, but hey, it was pre-coffee time.

What’s needed is a solution whereby psycopg2 takes care of the formatting for us, so that strings become strings, NULLs are passed in a way that PostgreSQL recognizes them, dates are passed in the proper format, and all that jazz.

My next attempt looked like this:

curs.execute("""SELECT * FROM myapp.new_user( %(fname)s, %(lname)s,
%(dob)s, %(city)s, %(state)s, %(zip)s""", qdict)

This is, according to some articles, blog posts, and at least one reply on the psycopg mailing list “the right way” to call a function using psycopg2 with PostgreSQL. I’m here to tell you that this is not correct to the best of my knowledge.The only real difference between this attempt and the last is I’ve replaced the “%” with a comma, which turns what *was* a string formatting operation into a proper SELECT with a psycopg2-recognized parameter list. I thought this would get psycopg2 to “just work”, but no such luck. I still had some quoting issues.

I have no idea where I read this little tidbit about psycopg2 being able to convert between Python and PostgreSQL data types, but I did. Right around the same time I was thinking “it’s goofy to issue a SELECT to call a function that doesn’t really want to SELECT anything. Can’t callproc() do this?” Turns out callproc() is really the right way to do this (where “right” is defined by the DB-API which is the spec for writing a Python database module). Also turns out that psycopg2 can and will do the type conversions. Properly, even (in my experience so far).

So here’s what I got to work:

callproc_params = [self.fname, self.lname, self.dob, self.city, self.state, self.zipcode]

curs.callproc('myapp.new_user', callproc_params)

This is great! Zero manual quoting or string formatting at all! And no “SELECT”. Just call the procedure and pass the parameters. The only thing I had to change in my code was to make my ‘self.dob’ into a datetime.date() object, but that’s super easy, and after that psycopg2 takes care of the type conversion from a Python date to a PostgreSQL date. Tomorrow I’m actually going to try calling callproc() with a list object inside the second argument. Wish me luck!

A quick cursor gotcha

I made a really goofy mistake. At the root of it, what I did was share a connection *and a cursor object* among all methods of a class I created to abstract database operations out of my code. So, I did something like this (this is not the exact code, and it’s untested. Treat it like pseudocode):

class MyData(object):
   def __init__(self, dsn):
      self.conn = psycopg2.Connection(dsn)
      self.cursor = self.conn.cursor()

   def get_users_by_regdate(self, regdate, limit):
      self.cursor.arraysize = limit
      self.cursor.callproc('myapp.uid_by_regdate', regdate)
      while True:
         result = self.cursor.fetchmany()
         if not result:
            break
         yield result

   def user_is_subscribed(self, uid):
      self.cursor.callproc('myapp.uid_subscribed', uid)
      result = self.cursor.fetchone()
      val = result[0]
      return val

Now, in the code that uses this class, I want to grab all of the users registered on a given date, and see if they’re subscribed to, say, a mailing list, an RSS feed, a service, or whatever. See if you can predict the issue I had when I executed this:

    db = MyData(dsn)
    for id in db.get_users_by_regdate([joindate]):
        idcount += 1
        print idcount
        param = [id]
        if db.user_is_subscribed(param):
            print "User subscribed"
            skip_count += 1
            continue
        else:
            print "Not good"
            continue

Note that the above is test code. I don’t actually want to continue to the top of the loop regardless of what happens in production :)

So what I found happening is that, if I just commented out the portion of the code that makes a database call *inside* the for loop, I could print ‘idcount’ all the way up to thousands of results (however many results there were). But if I left it in, only 100 results made it to ‘db.user_is_subscribed’.

Hey, ’100′ is what I’d set the curs.arraysize() to! Hey, I’m using the *same cursor* to make both calls! And with the for loop, the cursor is being called upon to produce one recordset while it’s still trying to produce the first recordset!

Tom Roberts, on the psycopg list, states the issue concisely:

The cursor is stateful; it only contains information about the last
query that was executed. On your first call to “fetchmany”, you fetch a
block of results from the original query, and cache them. Then,
db.user_is_subscribed calls “execute” again. The cursor now throws away all
of the information about your first query, and fetches a new set of
results. Presumably, user_is_subscribed then consumes that dataset and
returns. Now, the cursor is position at end of results. The rows you
cached get returned by your iterator, then you call fetchmany again, but
there’s nothing left to fetch…

…So, the lesson is if you need a new recordset, you create a new cursor.

Lesson learned. I still think it’d be nice if psycopg2 had more/better docs, though.

Python Quirks in Cmd, urllib2, and decorators

By m0j0, October 28, 2009 10:06 pm

So, if you haven’t been following along, Python programming now occupies the bulk of my work day.

While I really like writing code and like it more using Python, no language is without its quirks. Let me say up front that I don’t consider these quirks bugs or big hulking issues. I’m not trying to bash the language. I’m just trying to help folks who trip over some of these things that I found to be slightly less than obvious.

Python’s Cmd Module and Handling Arguments

Using the Python Cmd module lets you create a program that provides an interactive shell interface to your users. It’s really simple, too. You just create a class that inherits from cmd.Cmd, and define a bunch of methods named do_<something>, where <something> is the actual command your user will run in your custom shell.

So if you want users to be able to launch your app, be greeted with a prompt, type “hello”, and have something happen in response, you just define a method called “do_hello” and whatever code you put there will be run when a user types “hello” in your shell. Here’s what that would look like:

import cmd

class MyShell(cmd.Cmd):
   def do_hello(self):
      print "Hello!"

# Kick off the shell
shell = MyShell()
shell.cmdloop()

Of course, what’s a shell without command line options and arguments? For example, I created a shell-based app using Cmd that allowed users to run a ‘connect’ command with arguments for host, port, user, and password. Within the shell, the command would look something like this:

> connect -h mybox -u jonesy -p mypass

Note that the “>” is the prompt, not part of the command.

The idea here is that you pass the arguments to the option flags, and you can set sane defaults in the application for missing args (for example, I didn’t provide a port here — I’m leaning on a default, but I did provide a host, since the default might be ‘localhost’).

Passing just one, single-word argument with Cmd is dead easy, because all of the command methods receive a string that contains *everything* on the line after the actual command. If you’re expecting such an argument, just make sure your ‘do_something’ method accepts the incoming string. So, to let users see what “hello” looks like in Spanish, we can accept “esp” as an argument to our command:

class MyShell(cmd.Cmd):
   def do_hello(self, arg):
      print "Hello! %s" % arg

The problems come when you want more than one argument, or when you want flags with arguments. For example, in the earlier “connect” example, my “do_connect” method is still only going to get one big, long string passed to it — not a list of arguments. So where in a normal program you might do something like:

class MyShell(cmd.Cmd):
   def do_connect(self, host='localhost', port='42', user='guest', password='guest'):
      #...connection code here...

In a Cmd method, you’re just going to define it like we did the do_hello method above: it takes ‘self’ and ‘args’, where ‘args’ is one long line.

A couple of quick workarounds I’ve tried:

Parse the line yourself. I created a method in my Cmd app called ‘parseargs’ that just takes the big long line and returns a dictionary. My specific application only takes ‘name=value’ arguments, so I do this:

         d = dict([arg.split('=') for arg in args.split()])

And return the dictionary to the calling method. My connect method can then check for keys in the dictionary and set things up. It’s longer an a little more arduous, but not too bad.

Use optparse. You can instantiate a parser right inside your do_x methods. If you have a lot of methods that all need to take several flags and args, this could become cumbersome, but for one or two it’s not so bad. The key to doing this is creating a list from the Big Long Line and passing it to the parse_args() method of your parser object. Here’s what it looks like:

class MyShell(cmd.Cmd):
   def do_touch(self, line):
      parser = optparse.OptionParser()
      parser.add_option('-f', '--file', dest='fname')
      parser.add_option('-d', '--dir', dest='dir')
      (options,args) = parser.parse_args(line.split())

      print "Directory: %s" % options.dir
      print "File name: %s" % options.fname

This method is just an example, so don’t scratch your head looking for “import os” or anything :)

This is probably the more elegant solution, since it doesn’t require you to restrict your users to passing args in a particular way, and doesn’t require you to come up with fancy CLI argument parsing algorithms.

Using urllib2 for Pure XML Over HTTP

I wrote a web service client this week that does pure XML over HTTP to send queries to a service. I’ve written things like this before using Python, but it turns out, after looking back at my code, I was always either using XMLRPC, SOAP, or going through some wrapper that hid a lot from me in an effort to make my life easier (like the Google Data API). I’ve never had to try to send a pure XML payload over the wire to a web server.

I figured urllib2 was going to help me out here, and it did, but not before going through some pain due mainly to an odd pattern in various sources of documentation on the topic. I read docs at python.org, effbot.org, a couple of blogs, and did a Google search, and everything, everywhere, seems to indicate that the urllib2.Request object’s optional “data” argument expects a urlencoded string. From http://docs.python.org/library/urllib2.html?highlight=urllib2.request#urllib2.Request

data should be a buffer in the standard application/x-www-form-urlencoded format

The examples on every site I’ve found always pass whatever ‘data’ is through urllib.urlencode() before adding it to the request. I figured urllib2 was no longer my friend, and almost started looking at implementing an HTTPSClient object. Instead I decided to try just passing my unencoded data. What’s it gonna do, detect that my data wasn’t urlencoded? Maybe I’d learn something.

I learned that all of the documentation fails to account for this particular edge case. Go ahead and pass whatever the heck you want in ‘data’. If it’s what the server on the other end expects, you’ll be fine. :)

Decorators

I found myself in dark, dusty corners when I had to decide how and where inside of a much larger piece of code to implement a feature. I really wanted to use a decorator, and still think that’s what I’ll wind up doing, but then how to implement the decorator isn’t as straightforward as I’d like either.

Decorators are used to alter how a decorated function operates. They’re amazingly useful, because instead of implementing some bit of code in a bunch of methods that themselves live inside a bunch of classes across various modules, or creating an entire class or mixin to inherit from when you only need the code overhead in a couple of edge cases, you can just create a decorator and apply it only to the proper methods or functions.

The lesson I learned is to try very hard to make one solid decision about how your decorator will work up front. Will it be a class? That’s done somewhat differently than doing it with a function. Will the decorator take arguments? That’s handled differently in both implementations, and also requires changes to an existing decorator class that didn’t used to take arguments. I don’t know why I expected this to be more straightforward, but I totally did.

If you’re new to decorators or haven’t had to dig into them too deeply, I highly recommend Bruce Eckel’s series introducing Python decorators, which walks you through all of the various ways to implement them. Part I (of 3) is here.

Panorama Theme by Themocracy