Category Archives: CodeKata

Number Spiral Fun

I love puzzles, and I came across the Oblong Number Spiral challenge on Code Golf over the past week and dove in. Here’s the basic idea, from the challenge at Code Golf:

This challenge involves you having to create a number spiral such as :

 1  2  3
10 11  4
 9 12  5
 8  7  6

Note how the numbers spiral in clockwise towards the centre. Here’s a larger example :

 1  2  3  4  5
18 19 20 21  6
17 28 29 22  7
16 27 30 23  8
15 26 25 24  9
14 13 12 11 10

So, the program should take input as two numbers separated by a space, which represent the width and height of the matrix you’ll output. Then you need to determine the location in the output for each number in the matrix, so when it’s sent to the screen, it looks like the examples above.

I’ve never done a spiral matrix before, but I suspected I’d probably start with a matrix full of dummy values that I’d replace using some algorithm that would understand how to turn “turn and move in a negative direction along the x axis” into placement in the matrix. I did a little reading and found I was on the right path, so I kept on trudging along and this is what I came up with:

#!/usr/bin/env python

def number_spiral(h, w):
   # total number of elements in array
   n = w * h

   # start at top left (row 0 column 0)
   row, column = 0,0

   # first move is on the same row, to the right
   d_row, d_column = 0, 1

   # fill 2d array with dummy values we'll overwrite later
   arr = [[None ]* w for z in range(h)]

   for i in xrange(1,n+1):
      arr[row][column] = i

      # next row and column
      nrow, ncolumn = row + d_row, column + d_column

      if 0 <= nrow < h and 0 <= ncolumn < w and arr[nrow][ncolumn] == None:
         # no out of bounds accesses or overwriting already-placed elements.
         row, column = nrow, ncolumn
      else:
         # change direction
         d_row , d_column = d_column, -d_row
         row, column = row + d_row, column + d_column

   # print it out!
   for a in range(h):
      for b in range(w):
         print "%2i" % arr[a][b],
      print

if __name__ == '__main__':
   number_spiral(5, 3)

CodeKata 4: Data Munging

I’m continuing to take on the items in Dave Thomas’s Code Kata collection. It’s a nice way to spend a Sunday night, and it’s a good way to get my brain going again before work on Monday morning. It’s also fun and educational :)

CodeKata 4 is called “Data Munging”. It’s not very difficult data munging, really. I think the more interesting bit of the Kata is how to deal with files in different languages, what tools are well suited to the task, and trying to minimize code duplication.

Description

Code Kata 4 instructs us to download weather.dat, a listing of weather information for each day of June, 2002 in Morristown, NJ. We’re to write a program that identifies the day which has the smallest gap between min and max temperatures.

Then, it says to download football.dat, a listing of season statistics for Soccer teams. We’re to write a program that identifies the team with the smallest gap between goals scored for the team, and goals scored against the team.

Once those are done, we’re asked to try to factor out as much duplicate code as possible between the two programs, and then we’re asked a few questions to help us think more deeply about what just transpired.

My Attack Plan

The first thing that jumped into my brain when I saw the data files was “Awk would be perfect for this”. I fiddled with that for a little too long (my awk is a little rusty), and came up with this (for weather.dat):

>$ awk 'BEGIN {min=100000}; $1 ~ /^[1-3]/ {x[$1]=$2-$3; if (x[$1]<min){ min=x[$1]; winner=$1}} END {print winner, min} ' weather.dat.dat
14 2

It works, and awk, though it gets ugly to some, reads in a nice, linear way to me. You give it a filter expression, and then statements to act on the matching lines (in braces). What could be simpler?

After proving to myself that I hadn’t completely lost my awk-fu, I went about writing a Python script to deal with the problem. I read ahead in the problem description, though, and so my script contains separate blocks for the two data sets in one script:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()

if 'weather' in sys.argv[1]:
   winner = 1000000
   winnerday = None

   for line in data:
      #filter out lines that aren't numbered.
      if line.strip().startswith(tuple(string.digits)):
         # we only need the first three fields to do our work
         l = line.split()[:3]
         # some temps have asterisks attached to them.
         maxt = l[1].strip(string.punctuation)
         mint = l[2].strip(string.punctuation)
         diff = int(maxt) - int(mint)
         if diff < winner:
            winner = diff
            winnerday = l[0]
   print "On day %s, the temp difference was only %d degrees!" % (winnerday, winner)

if 'football' in sys.argv[1]:
   winner = 1000000
   winnerteam = None

   for line in data:
      if line.strip().startswith(tuple(string.digits)):
         l = line.split()
         team, f, a = l[1], int(l[6]), int(l[8])
         diff = abs(f - a)
         if diff < winner:
            winner = diff
            winnerteam = team
   print "Team %s had a for/against gap of only %d points!" % (winnerteam, winner)

Really, the logic employed is not much different from the awk solution:

  1. Set a default for ‘winner’ that’s unlikely to be rivaled by the data :)
  2. Set the default for the winning team or day to ‘None’
  3. Filter out unwanted lines in the dataset.
  4. Grab bits of each line that are useful.
  5. Assign each useful bit to a variable.
  6. Do math.
  7. Do comparisons
  8. Present results.

Refactoring

Part 2 of the Kata says to factor out as much duplicate code as possible. I was able to factor out almost all of it on the first shot at refactoring, leaving only the line of code (per file) to identify the relevant columns of each data set:

#!/usr/bin/env python
import sys
import string

data = open(sys.argv[1], 'r').readlines()
data.sort()
winner_val = 1000000
winner_id = None

for line in data:
   if line.strip().startswith(tuple(string.digits)):
      l = line.split()
      if 'weather' in sys.argv[1]:
         identifier, minuend, subtrahend = l[0], int(l[1].strip(string.punctuation)), int(l[2].strip(string.punctuation))
      elif 'football' in sys.argv[1]:
         identifier, minuend, subtrahend = l[1], int(l[6]), int(l[8])
      diff = abs(minuend - subtrahend)
      if diff < winner_val:
         winner_val = diff
         winner_id = identifier

print winner_id, winner_val

Not too bad. I could’ve done some other things to make things work differently: for example I could’ve let the user feed in the column header names of the ‘identifier’, ‘minuend’ and ‘subtrahend’ columns in each data set, and then I could just *not* parse out the header line and instead use it to identify the list index positions of the bits I need for each line. It’d make the whole thing ‘just work’. It would also require more effort from the user. On the other hand, it would make things “just work” for just about any file with numbered lines of columnar data.

I have to admit that the minute I see columnar data like this, awk is the first thing I reach for, so I’m sure this affected my Python solution. The good news there is that my thinking toward columnar data is consistent, and so I treated both files pretty much the same way, making refactoring a 5-minute process.

In all, I enjoyed this Kata. Though I didn’t take it as far as I could have, it did make me think about how it could be improved and made more generic. Those improvements could incur a cost in terms of readability I suppose, but I think for this example it wouldn’t be a problem. I’m working on a larger project now where I have precisely this issue of flexibility vs. readability though.

I’m reengineering a rather gangly application to enable things like pluggable… everything. It talks to networked queues, so the protocol is pluggable. It talks to databases, so the database back end is pluggable, in addition to the actual data processing routines. Enabling this level of flexibility introduces some complexity, and really requires good documentation if we reasonably expect people to work with our code (the project should be released as open source in the coming weeks). Without the documentation, I’m sure I’d have trouble maintaining the code myself!

Python Code Kata 2: Karate Chop (Binary Search)

I’ve decided it’d be fun to go through the Code Kata put together by Dave Thomas, co-author of The Pragmatic Programmer, and co-founder of Pragmatic Programmers, LLC (publishers of the finest tech books on the market right now; they can’t expand their selection fast enough for me).

Anyway, the first Code Kata, while being truly interesting to ponder, doesn’t really involve code, so I’m starting with Kata 2, which is a binary search exercise. It occurred to me while reading it over that I’d never written a binary search in Python, because you typically don’t need to, since list objects have a method called ‘index()’ which will hand you the index position of your search target within the list. I decided to force myself not to rely on it, as an exercise. If you’ve never had to go through a computer science curriculum, but want to excel at programming, I suggest doing all of the Kata, and getting your hands on at least one good computer science text (a computer science grad student hanging around is nice, too!)

Binary Search Background

So, the idea behind a binary search is often likened to looking up a name in a phone book, which is a pretty good analogy. A phone book has thousands of pages and perhaps millions of names, and yet it doesn’t take us long to find any given name, no matter which part of the book it’s in. You open the book to the middle, and the name you’re looking for is either on one of the facing pages, in the first half of the book, or in the last half of the book. After glancing at a single page, you’ve reduced your haystack by 50%. You flip to the middle of the proper half of the book, and repeat the process. If you had a phone book listing every person in the entire world (say, 7,000,000,000 people), you could find any given person in 33 steps or less. That’s pretty efficient (in big O notation, it’s O(logn)).

You can use a binary algorithm to tease your friends, too. Bet a friend $1 that you can guess any number they pick between 1 and 100 in 7 guesses if they tell you if your guess is too low or too high. If you use a binary algorithm to make your guesses, it’s not possible to lose this bet as far as I know. Here’s an example:

Your buddy chooses number 70. Your guesses will be:

  1. 50 (too low)
  2. 75 (too high)
  3. 62 (too low)
  4. 69 (too low)
  5. 72 (too high)
  6. 71 (too high)
  7. 70 (that’s it!)

One more. Your buddy chooses the number 1. Your guesses will be:

  1. 50 (too high)
  2. 25 (too high)
  3. 13 (too high)
  4. 7 (too high)
  5. 4 (too high)
  6. 2 (too high)
  7. 1  (that’s it!)

But what if your buddy wants to up the stakes by increasing the scope to numbers between 1 and 1000? Say he picks 485:

  1. 500 (too high)
  2. 250 (too low)
  3. 375 (too low)
  4. 438 (too low)
  5. 469 (too low)
  6. 485 (that’s it!)

You got lucky on that last one, but you might’ve negotiated more guesses for yourself. To do that and still guarantee a win, figure out what log2N is, where N is the highest number they can choose. So, log2100 = 7, and log21000 = 10. Another tip: when calculating your next guess, always round your division UP. Notice that log27000000000 = 32.7, which is where I got the max of 33 steps above.

Ok, now on to the Kata!!

Binary Search in Python Take 1: Traditional Binary Search

This is just my take on a traditional binary search algorithm in Python:

def chop(needle, haystack):
   """
   Takes an integer target (needle) and a sorted list of integers (haystack),
   and performs a binary search for the target. Return the integer index of the
   target in the array, or -1 if it's not found.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

This makes assumptions. Namely, it assumes a 0-indexed, sorted list with no missing values. So, my test lists are always output from range(N), where N is the number of elements I want to search. It works well enough. I think a CS prof would accept this. Let’s move on.

What if I break it in *THREE* (oooooohhhhhh)

If breaking the list in two is so cool, why don’t we break it in three pieces? We can do one extra check, and then cut out 2/3 of the haystack in one fell swoop! Ok, let’s see how it goes:

def chop3(needle, haystack):

   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      brk1 = (lo+hi)/3
      brk2 = (2 * (lo+hi)) / 3
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2

This one attempts to cut the initial array into three parts instead of bisecting. This won’t work, actually. At some point, you’ll shrink the list down two where the lower bound is 1/2 the upper bound, at which point, all hell breaks loose. Think about how this works when lo is 16 and hi is 32. 16+32 = 48, and 48/3 = 16! This means brk1 never actually changes, and you’ve created perfect conditions for an infinite loop.

Instead, define the hi and lo variables and break points in the list differently:

def chop3_take2(needle, haystack):
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      brk1 = lo+incr
      brk2 = lo + (incr * 2)
      val1 = haystack[brk1]
      val2 = haystack[brk2]
      if needle < val1:
         hi = brk1
      elif needle > val1 and needle < val2:
         hi = brk2
         lo = brk1
      elif needle > val2:
         lo = brk2
      elif needle == val1:
         break
      elif needle == val2:
         break

This one will work. It’s slightly more heavy lifting than chop2, but not too bad. In the calculation for the brk variables, I’ve decided to calculate an increment defined as 1/3 the distance between lo and hi. So, if lo=16 and hi=32, instead of brk1 becoming 16 and looping forever, it becomes the point 1/3 of the way between lo and hi. It makes fewer assumptions, which in my experience means fewer bugs. (32-16)/3 == 5 (the remainder in integer division is truncated), so brk1 is going to be 21 instead of 16.

From simple tests, it doesn’t appear that this extra bit munging buys you anything. On a list with 1 million elements, chop() and chop3_take2() run in almost the exact same amount of time. Increasing the list size to 10 million elements starts to show evidence that the initial chop() routine is going to scale better than chop3_take2(). I think chop() is also nicer to read and understand, which makes it a better candidate for production code.

Combine the two?

I sorta wondered what would happen if I only did the 2/3 reduction on the inital pass, and then passed the part of the array where the target exists to chop(). The function would look a lot like chop3_take2, but inside the if…elif parts, instead of setting new vals for lo or hi, we’d pass the part of haystack bounded by lo and hi to chop().

A nice idea except that, if you recall, chop() assumes a zero-indexed array, and our chop3_take2() function sets new values for lo in some instances, so what we really need is a chop_non_zero_index() function. Here’s one:

def chop_non_zero_index(needle, haystack):
   lo = 0
   hi = len(haystack)
   while lo < hi:
      mid = (lo+hi)/2
      midval = haystack[mid]
      if needle < midval:
         hi = mid
      elif needle > midval:
         lo = mid
      else:
         print "Found %d at index %d" % (needle, mid)
         break

And then we pass portions of our haystack to chop_non_zero_index() from a function called chop_salad(). The chop_salad() function does an initial pass to cut out 2/3 of the candidates, then delegates to chop_non_zero_index() for the rest.

def chop_salad(needle, haystack):
   """Same as take 1 above, but this time we pass to chop_non_zero_index
   after the first pass.
   """
   lo = haystack[0]
   hi = len(haystack)
   while lo < hi:
      incr = (hi-lo)/3
      print "increment = %d" % incr
      brk1 = lo+incr
      print "brk1 = %d" % brk1
      brk2 = lo + (incr * 2)
      print "brk2 = %d" % brk2
      val1 = haystack[brk1]
      val2 = haystack[brk2]

      if needle < val1:
         chop_non_zero_index(needle, haystack[:brk1+1])
         break
      elif needle > val1 and needle < val2:
         chop_non_zero_index(needle, haystack[brk1:brk2+1])
         break
      elif needle > val2:
         chop_non_zero_index(needle, haystack[brk2:])
         break
      elif needle == val1:
         print "Found it @val1!"
         break
      elif needle == val2:
         print "Found it @val2!"
         break

At this point, we’re getting into doing more list manipulations, more comparisons, passing stuff between functions… not nearly as clean and simple as plain old chop(), and it doesn’t buy you anything in terms of performance either.

Extend the List Object, and using Recursion

I almost never extend built-in objects like list or dict. I couldn’t remember the last time I did it, but I wondered if there was any benefit to adding a bisect() operation to a list object, so I gave it a shot. It’s not hard or cool or anything. Here’s my list object extension:

class mylist(list):
   def bisect(self):
      return [mylist(self[:len(self)/2]),mylist(self[len(self)/2:])]

And to use it, you create a haystack that is a mylist object, then pass it to a recursive function that’ll call bisect()

def chop_bisect_object(needle, haystack):
   while True:
      head, tail = haystack.bisect()
      try:
         if needle < head[-1]:
            chop_bisect_object(needle, head)
            break
         elif needle > tail[0]:
            chop_bisect_object(needle, tail)
            break
         elif needle == head[0]:
            print "needle in head[0]"
            break
         elif needle == tail[0]:
            print "needle in tail[0]"
            break

if __name__ == "__main__":
   x = mylist(range(1000000))
   chop_bisect_object(needle,x)

Finally, Chef Guido Chops

Of course, this was all a fun exercise to get the synapses sparking. The reality is that, if all you really want to do is find out where in a list some value exists, you call l.index(val), where val is what you’re looking for. This method only runs slightly faster than the fastest algorithm I was able to come up with (for both smallish and largish haystacks), but it’s absolutely the most readable, and being that it’s so short, it has the least number of ways to introduce bugs, so it really is the right tool for the job in this case.

Onward

First, I don’t put forth these examples as flawless. Quite the contrary. I assume others have some cool ideas, or can point out some mistakes I’ve made, so please share in the comments!

I plan to go through the rest of the exercises in the Code Kata, and I also have a couple of interesting ones of my own I might throw in. I’ll try to post roughly once a week about this, so if you’re interested, subscribe to my RSS feed, or just the Python or CodeKata feeds.