Stop Doing Things That Don’t Work (a.k.a: Excel and Virtual Private Servers are Evil)

Note that I’m talking about using these tools in some kind of professional way, and more specifically, I’m talking about using Excel as a database, and using VPS hosting to host “professional” web sites. By “professional”, I mean something other than your personal blog, picture gallery, or other relatively inconsequential site.

Excel is not a database

Here’s the thing: Excel isn’t a database. Most people who don’t work in IT don’t seem to understand this, and they’re deathly afraid to actually communicate with anyone in IT, so they take matters into their own hands, and create problems so big that IT is forced to get involved, because at some point this spreadsheet becomes “critical” to some business function. Then IT gets even more bitter toward the non-IT folk, validating some of the reasons the non-IT folk went that route in the first place, and virtually guaranteeing that they won’t come to the IT group next time either.

So, if you don’t work in IT and are not a geek, know this: Excel is not a database. Excel is not meant to manage data on a long-term basis. For everything you can do with Excel, there is almost certainly a better tool for the job. This isn’t to say that Excel is good for *nothing*, just that it’s generally not good in places where data needs to be managed over the longer term, shared with others, and relied upon for day-to-day operations of a business or department.

Find someone in IT who seems nice and “deals with databases”, and ask them what their thoughts are on the topic. Then tell them the *actual problem you’re trying to solve*, and ask how they would approach it. You’re not likely to hear “Excel” in the reply unless Excel is so rampant in your company that it’s become a corporate standard for creating data fiefdoms, which would be bad.

A VPS is Not “Professional Grade”. Ignore Adverts to the Contrary

No, really – I mean it. I’ve done plenty of consulting for companies who need some kind of fire put out for one of their web sites. Not long into the conversation I learn (for about 50% of the calls I get) that the site is externally hosted on a VPS. Occasionally I get people whose sites are, or are supposed to be, hosted on dedicated servers, but the actual VPS/dedicated server isn’t really the whole issue. The issue is with how these things are configured, and your ability to do what you need with them.

Marketing for VPS and dedicated server hosting often say “full root access” somewhere in the list of features. There are also specs like the CPU speed, amount of RAM, and bandwidth limits. All of these come together to give the unwitting customer the notion that they’re getting full root access to some kind of behemouth server with all kinds of resources. However, things go downhill when you see things like cPanel, Plesk, or anything else that looks like “easy management through web-based administrative interface”. Again, this is probably fine for something that gets 100 hits per month or so and isn’t critical. The minute you can attach a cost to the problems that can arise with your site, you need to ditch these hosting plans.

Why? There are numerous reasons, but I’ll start with three:

  1. There’s typically no failover or “high availability”: if one machine goes down, or one VPS on the same hardware goes nuts, you’ve just ceased to exist on the internet at all.
  2. The CPU and RAM advertised is used mostly by the bloated software used to automate the management and monitoring of the systems (in other words, it’s used by your hosting provider, not your own application).
  3. The system configurations I’ve seen in these environments borders on retarded, and since the end user is managing all of this through a web interface, the only folks left to blame are the providers. So when you have problems, they’re guaranteed to be extra-challenging to solve.

What kinds of system configuration issues? Well, how about every service turned on, every port open (and not filtered) by default? How about downright broken service configurations, ranging from named.conf (DNS) configs specifying features that *can’t* work as configured, to crippled package management tools that disallow package modifications because doing so would break the monitoring/management tools, to php.ini files that turn on displayErrors and turn *off* log_errors. In general, logging configurations are poor or worse, making problem-solving an uphill battle. Every time I log into a VPS I am typically shocked and appalled at what I find. Even if it’s $5 a month, it’s not worth it.

Think about it: if you have a VPS and you have database corruption, what happens? You call support, who will probably just confirm or deny that actions performed by them or their automated routines had anything to do with the corruption (if they were forced to uncleanly reboot the machine, for example, that might explain things). Usually, they’ll say they don’t have any record of any events on the server that might be an issue, and you’ll need to fix it yourself (that’s what you wanted “full root access” for in the first place, right?).

So, you get a system or database guy to look into things. He’ll find that there are no logs, broken configurations, and when he tries to make a change, it’s either overwritten by these wacky automated management routines, or it breaks some part of the web-based management interface. He’ll also find that, while your web site uses about 128MB of the 512MB of available RAM, the host is running software that takes up double that amount of RAM. Wow, what a deal you got!

All of these issues, by the way, can also occur on dedicated servers, but what sets VPS services apart is the performance: it is, at the very best, unpredictable, and often bad. Some hosts try to market their way around this by charging you more money for “low-density” VPS “solutions”. Don’t buy it. It’s not a density issue. Even if you only share the hardware that runs your VPS with *one* other VPS, if that other VPS goes crazy and starts performing huge amounts of disk reads and writes, your site, even if there are only 3 people looking at it, is going to be slow.

The solution? Well, evaluate whether or not you really need the control a VPS gives you. If you’re just running WordPress, a simple CMS, or a brochure web site, you almost certainly don’t need a VPS. Get a web hosting plan. They often offer one-click installations of wordpress and CMSes like PostNuke, PHP-Nuke, Joomla, Drupal, etc, along with phpMyAdmin for doing database operations. LinuxLaboratory.org runs on Drupal and MySQL, and houses a bunch of articles I’ve written about Linux, System/DB Administration, etc., that I’ve written over the years. It also presents a feed of the content on this blog, and it’s been running on a simple, cheap, web hosting plan for probably 7 or 8 years now. My uptime is better than the sites of friends of mine who decided they needed the control of a VPS. Same goes for this blog (though it’s a different provider). Heck, my beer blog runs on a *free* web hosting solution at DreamHost. It’s not super fast, but aside from that it serves its purpose well, and they have one-click installations for just about everything.

If you need to launch some kind of site that requires things not offered by a web hosting plan, then chances are you’re developing the site, or have some budget or staff for helping you setup/manage/troubleshoot the services you’ll run there. Check out Amazon EC2 and Google AppEngine, and look into dedicated hosting to see if any of those meet your needs.

If you have an IT department, you could, of course, try to work with them on a solution. This is almost always the best solution over the long haul.

Generating Reports with Charts Using Python: ReportLab

UPDATE (Mar. 26, 2010) Just realized I never posted the link to the PDF the code here generates: here it is. My bad.

I’ve been doing a little reporting project, and I’ve been searching around for quite some time for a good graphing and charting solution for general-purpose use. I had come across ReportLab before, but it just looked so huge and convoluted to me, given the simplicity of what I wanted at the time, that I moved on. This time was different.

This time I needed a lot of the capabilities of ReportLab. I needed to generate PDFs (this is not a web-based project), I needed to generate charts, and I wanted the reports I was generating to contain various types of text objects in addition to the charts and such.

I took the cliff-dive into the depths of the ReportLab documentation. I discovered three things:

  1. There is quite a lot of documentation
  2. ReportLab is quite a capable library
  3. The documentation actually defies the simplicity of the library.

It’s a decent bit easier than it looks in the documentation, so I thought I’d take you through an example. This example is dead simple, but I still think it’s a little more practical than what I was able to find. The ReportLab documentation refers to what sounds like a great reference example, but the problem is that the tarball I downloaded didn’t contain the files it was making reference to :(

I started out by investigating one of the small example projects in the “demo” directory of the ReportLab directory. It was called “gadflypaper” (Ironically, written by Aaron Watters. I worked in the cube outside of his office for several months last year — Hi Aaron!). Aaron’s example was very simple, and a great starting point to start understanding how to put together a very basic document. It’s not infested with abstractions — just a few simple functions, and a lot of text. I ripped out a lot of the text until I had just an example of each function in action, and then set to work.

The Basic Process

To simplify the work of doing page layout minutiae, I (like the example) used PLATYPUS, which is built into ReportLab and abstracts away some of the low-level layout details. If you *want* low-level control, however, you can do whatever you want with the pdfgen module, also included (and PLATYPUS is basically a layer built from it).

With PLATYPUS, you get access to a bunch of prebuilt layout-related objects, representing things like paragraphs, tables, frames, and other things. You also have access to page templates, so that dealing with things like frame placement is a little easier.

So, to give you a rundown of the high-level steps:

  1. Choose a page template, and use it to create a document object.
  2. Create your “flowables” (paragraphs, charts, images, etc), and put them all into a list object. In ReportLab documentation, this is often referred to as a list named “story”
  3. Pass the list object to the build() method of the document object you created in step 1.

Phase 1: Let’s Get Something Working

As a first phase, let’s just make sure we can do the simplest of documents. Here’s some code that should work if you have a good installation of ReportLab (I’m using whatever was the latest version in early October, 2008.) Note that we’ll be cleaning this up and simplifying it as we go along.

#!/usr/bin/env python

from reportlab.platypus import *
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.rl_config import defaultPageSize
from reportlab.lib.units import inch

PAGE_HEIGHT=defaultPageSize[1]
styles = getSampleStyleSheet()
Title = Paragraph("Generating Reports with Python", styles["Heading1"])
Author = Paragraph("Brian K. Jones", styles["Normal"])
URL = Paragraph("http://www.protocolostomy.com", styles["Normal"])
email = Paragraph("bkjones +_at_+ gmail.com", styles["Normal"])
Abstract = Paragraph("""This is a simple example document that illustrates how to put together a basic PDF with a chart.
I used the PLATYPUS library, which is part of ReportLab, and the charting capabilities built into ReportLab.""", styles["Normal"])

Elements = [Title, Author, URL, email, Abstract]

def go():
   doc = SimpleDocTemplate('gfe.pdf')
   doc.build(Elements)

go()

Not a lot of actual code here. It’s mostly variable assignments. The variables are mostly just strings, but because I want to control how they’re arranged, I need to make them “Flowables”. Remember that PLATYPUS puts together a document by processing a list of Flowable objects and drawing them onto the document. So all of our strings are “Paragraph” objects. You’ll note, too, that Paragraph objects can be styled using definitions accessed from getSampleStyleSheet, which returns a ‘style object’. If you create one of these at the Python interpreter, and call the resulting object’s ‘list()’ function, you’ll see what styles are available, and you’ll also see what attributes each style has. Try running this code to make sure things work. Change the strings if you like :)

Phase 2: Simple Cleanup

I haven’t yet created insane layers of abstraction in my own code, because I’ve been working on deadlines and doing things that are relatively simple. This will inevitably change :)  However, there are some things you can do to make life a bit simpler and cleaner.

#!/usr/bin/env python

from reportlab.platypus import *
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.rl_config import defaultPageSize
from reportlab.lib.units import inch

PAGE_HEIGHT=defaultPageSize[1]
styles = getSampleStyleSheet()
Title = "Generating Reports with Python"
Author = "Brian K. Jones"
URL = "http://www.protocolostomy.com"
email = "bkjones@gmail.com"
Abstract = """This is a simple example document that illustrates how to put together a basic PDF with a chart.
I used the PLATYPUS library, which is part of ReportLab, and the charting capabilities built into ReportLab."""
Elements=[]
HeaderStyle = styles["Heading1"]
ParaStyle = styles["Normal"]
PreStyle = styles["Code"]

def header(txt, style=HeaderStyle, klass=Paragraph, sep=0.3):
    s = Spacer(0.2*inch, sep*inch)
    Elements.append(s)
    para = klass(txt, style)
    Elements.append(para)

def p(txt):
    return header(txt, style=ParaStyle, sep=0.1)

def go():
    doc = SimpleDocTemplate('gfe.pdf')
    doc.build(Elements)

header(Title)
header(Author, sep=0.1, style=ParaStyle)
header(URL, sep=0.1, style=ParaStyle)
header(email, sep=0.1, style=ParaStyle)
header("ABSTRACT")
p(Abstract)

go()

So, this is still simple. Simplistic, even. All I did was move the repetitive bits to functions. The ‘header’ and ‘p’ functions are (for now) unaltered from the gadflypaper demo. The good part here is that strings can be defined as ‘just strings’. Paragraphs and headers are just plain old string variables, and then at the bottom I just call the ‘header’ and ‘p’ functions and pass in the variables. The order in which I call the functions determines the order my document will appear in.

Phase 3

There’s kind of an issue with the way these functions work, at least for my needs. The problem is that they just go ahead and add things to the “Elements” list automagically. This might be ok for some quick and dirty tasks, but in my case I found that I needed more control. Things were crossing page boundaries where I didn’t want them to, and if I want to add formatting or apply built-in functionality, I can’t do it on a per-object basis without loading up the argument list.

I also wanted to have a relatively easy way to move *sections* of reports around, where a section might consist of a heading, a paragraph, and a source code listing — three different “Flowable” objects. So I altered these functions to make them return flowables instead of just adding things to the Elements list for me:

#!/usr/bin/env python

from reportlab.platypus import *
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.rl_config import defaultPageSize
from reportlab.lib.units import inch

PAGE_HEIGHT=defaultPageSize[1]
styles = getSampleStyleSheet()
Title = "Generating Reports with Python"
Author = "Brian K. Jones"
URL = "http://www.protocolostomy.com"
email = "bkjones@gmail.com"
Abstract = """This is a simple example document that illustrates how to put together a basic PDF with a chart.
I used the PLATYPUS library, which is part of ReportLab, and the charting capabilities built into ReportLab."""
Elements=[]
HeaderStyle = styles["Heading1"]
ParaStyle = styles["Normal"]
PreStyle = styles["Code"]

def header(txt, style=HeaderStyle, klass=Paragraph, sep=0.3):
    s = Spacer(0.2*inch, sep*inch)
    para = klass(txt, style)
    sect = [s, para]
    result = KeepTogether(sect)
    return result

def p(txt):
    return header(txt, style=ParaStyle, sep=0.1)

def pre(txt):
    s = Spacer(0.1*inch, 0.1*inch)
    p = Preformatted(txt, PreStyle)
    precomps = [s,p]
    result = KeepTogether(precomps)
    return result

def go():
    doc = SimpleDocTemplate('gfe.pdf')
    doc.build(Elements)

mytitle = header(Title)
myname = header(Author, sep=0.1, style=ParaStyle)
mysite = header(URL, sep=0.1, style=ParaStyle)
mymail = header(email, sep=0.1, style=ParaStyle)
abstract_title = header("ABSTRACT")
myabstract = p(Abstract)
head_info = [mytitle, myname, mysite, mymail, abstract_title, myabstract]
Elements.extend(head_info)

code_title = header("Basic code to produce output")
code_explain = p("""This is a snippet of code. It's an example using the Preformatted flowable object, which
                 makes it easy to put code into your documents. Enjoy!""")
code_source = pre("""
def header(txt, style=HeaderStyle, klass=Paragraph, sep=0.3):
    s = Spacer(0.2*inch, sep*inch)
    para = klass(txt, style)
    sect = [s, para]
    result = KeepTogether(sect)
    return result

def p(txt):
    return header(txt, style=ParaStyle, sep=0.1)

def pre(txt):
    s = Spacer(0.1*inch, 0.1*inch)
    p = Preformatted(txt, PreStyle)
    precomps = [s,p]
    result = KeepTogether(precomps)
    return result

def go():
    doc = SimpleDocTemplate('gfe.pdf')
    doc.build(Elements)
    """)
codesection = [code_title, code_explain, code_source]
src = KeepTogether(codesection)
Elements.append(src)
go()

So, this isn’t too bad. It’s still functional programming. I’ll revamp it in another post to use objects, but for those readers who are still learning all of this, it might help to leave out the abstraction for now. What I liked about the gadflypaper demo was that it was quick and dirty. You could read it line by line, top to bottom, and understand what just happened without jumping back and forth between main() code and object code.

As you can see, I’m using the KeepTogether() method, in two different ways. In the functions, I use it so I don’t have to go back later and manually add spacer elements to the Elements array. Then, toward the bottom, I create a preformatted code snippet, and I use the KeepTogether method to make sure that all parts in the code section stay together without flowing across a page boundary. There are other options you can use to customize how your document deals with ‘orphan’ and ‘widow’ elements as well, so definitely check out the documentation for that (or keep reading this blog. i’ll get to it eventually).

So what’s left?

Phase 4: The Grand Finale

The rest of the code I add is to connect to a database, make a query, and then pass the data returned from the database to a function that creates a chart. I add the chart to the Elements, and we’re in business!

#!/usr/bin/env python
import MySQLdb
import sys
import string
from reportlab.graphics.shapes import Drawing
from reportlab.graphics.charts.linecharts import HorizontalLineChart
from reportlab.platypus import *
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.rl_config import defaultPageSize
from reportlab.lib.units import inch

dbhost = 'localhost'
dbname = 'httplog'
dbuser = 'jonesy'
dbpasswd = 'mypassword'

PAGE_HEIGHT=defaultPageSize[1]
styles = getSampleStyleSheet()
Title = "Generating Reports with Python"
Author = "Brian K. Jones"
URL = "http://www.protocolostomy.com"
email = "bkjones@gmail.com"
Abstract = """This is a simple example document that illustrates how to put together a basic PDF with a chart.
I used the PLATYPUS library, which is part of ReportLab, and the charting capabilities built into ReportLab."""
Elements=[]
HeaderStyle = styles["Heading1"]
ParaStyle = styles["Normal"]
PreStyle = styles["Code"]

def header(txt, style=HeaderStyle, klass=Paragraph, sep=0.3):
    s = Spacer(0.2*inch, sep*inch)
    para = klass(txt, style)
    sect = [s, para]
    result = KeepTogether(sect)
    return result

def p(txt):
    return header(txt, style=ParaStyle, sep=0.1)

def pre(txt):
    s = Spacer(0.1*inch, 0.1*inch)
    p = Preformatted(txt, PreStyle)
    precomps = [s,p]
    result = KeepTogether(precomps)
    return result

def connect():
   try:
      conn1 = MySQLdb.connect(host = dbhost, user = dbuser, passwd = dbpasswd, db = dbname)
      return conn1
   except MySQLdb.Error, e:
      print "Error %d: %s" % (e.args[0], e.args[1])
      sys.exit (1)

def getcursor(conn):
   cursor = conn.cursor()
   return cursor

def totalevents_hourly(rcursor):
    rcursor.execute("""select hour, count(*) as hits from hits group by hour;""")
    return rcursor

def graphout(catnames, data):
    drawing = Drawing(400, 200)
    lc = HorizontalLineChart()
    lc.x = 30
    lc.y = 50
    lc.height = 125
    lc.width = 350
    lc.data = data
    catNames = catnames
    lc.categoryAxis.categoryNames = catNames
    lc.categoryAxis.labels.boxAnchor = 'n'
    lc.valueAxis.valueMin = 0
    lc.valueAxis.valueMax = 1500
    lc.valueAxis.valueStep = 300
    lc.lines[0].strokeWidth = 2
    lc.lines[0].symbol = makeMarker('FilledCircle') # added to make filled circles.
    lc.lines[1].strokeWidth = 1.5
    drawing.add(lc)
    return drawing

def go():
    doc = SimpleDocTemplate('gfe.pdf')
    doc.build(Elements)

mytitle = header(Title)
myname = header(Author, sep=0.1, style=ParaStyle)
mysite = header(URL, sep=0.1, style=ParaStyle)
mymail = header(email, sep=0.1, style=ParaStyle)
abstract_title = header("ABSTRACT")
myabstract = p(Abstract)
head_info = [mytitle, myname, mysite, mymail, abstract_title, myabstract]
Elements.extend(head_info)

code_title = header("Basic code to produce output")
code_explain = p("""This is a snippet of code. It's an example using the Preformatted flowable object, which
                 makes it easy to put code into your documents. Enjoy!""")
code_source = pre("""
def header(txt, style=HeaderStyle, klass=Paragraph, sep=0.3):
    s = Spacer(0.2*inch, sep*inch)
    para = klass(txt, style)
    sect = [s, para]
    result = KeepTogether(sect)
    return result

def p(txt):
    return header(txt, style=ParaStyle, sep=0.1)

def pre(txt):
    s = Spacer(0.1*inch, 0.1*inch)
    p = Preformatted(txt, PreStyle)
    precomps = [s,p]
    result = KeepTogether(precomps)
    return result

def go():
    doc = SimpleDocTemplate('gfe.pdf')
    doc.build(Elements)
    """)
codesection = [code_title, code_explain, code_source]
src = KeepTogether(codesection)
Elements.append(src)

hourly_title = header("Hits logged, per hour")
hourly_explain = p("""This shows aggregate hits across a 24-hour period. """)

conn = connect()
cur = getcursor(conn)
te_hourly = totalevents_hourly(cur)
catnames = []
data = []
values = []
for row in te_hourly:
   catnames.append(str(row[0]))
   values.append(row[1])

data.append(values)
hourly_chart = graphout(catnames, data)
hourly_section = [hourly_title, hourly_explain, hourly_chart]
Elements.extend(hourly_section)

go()

So, I’ve muddied things up a bit. If you’ve written database code before, you can just look past it all. I don’t do anything magical there. In fact, the chart creation isn’t magical either. I’m sure there’s even a cleaner way to do it – but this works for the moment.

I get a connection object, use it to get a cursor, then pass the cursor to the query function, which passes back…. a query object: te_hourly. The chart I’m going to create needs ‘category’ names for the y-axis values, and then values to plot on the chart. In my case, the hour is row[0] and the total hits for that hour are in row[1]. I build my catnames and data lists, and then create “hourly_chart” by passing my lists to the graphout function. Finally, I add the chart, along with its title and explanation to the Elements list. Done!

For its part, the graphout function is mostly just a bunch of parameters I need to configure my HorizontalLineChart object. Once the chart is all set to go, I need to add it onto my Drawing object, and return the Drawing flowable object.

Not yet what I’d call “Beautiful Code”, but it works, and it’s likely to help some other folks wade through the ‘getting started’ hump with ReportLab. Hope it was useful.

When will the iPhone get features other phones have had for years?

The iPhone’s success is somewhat in spite of itself. Sure, you can access all kinds of applications that do all kinds of things, and they’ve really gotten a few things right, like (IMHO) the phone interface, but some of the things it lacks are starting to become embarrassments.

For example, for what, exactly, is the bluetooth interface intended? Probably 98% of the people who even know what bluetooth is and care that the iPhone has this support want to use it almost exclusively so that they can use their bluetooth headsets. Where I live, the law forbids using a phone while driving unless you have a hands-free capability (a.k.a a bluetooth headset). Unfortunately, the iPhone does *not* meet the need here, because there are NO VOICE COMMANDS! Sure, there are some cars that will take on some of that responsibility, and there are headsets coming out that have the commands built into them, but having this built into the phone is, imo, the right way to do this. I shouldn’t have to be in my car, or need a particular headset to get this feature.

Also, there is no MMS support on the iPhone, and the AT&T hack to work around this limitation might actually be *worse* than just having no solution at all. MMS is used to send and receive photos (among other things, like longer text messages and even video) on your phone. Again, lots of phones have had this for several years now. What happens when someone sends a photo? Well, AT&T stores the photo somewhere, and sends you a text message with a link to a site, and a username and password to use to access the photo. The username and password are both jumbled bunches of characters, and you’re supposed to stare at this text message for an hour until you can commit it to memory, and then click the link and go log in to see the picture. You can’t *do* anything with the picture, mind you — you can look at it, and then you can close your browser. I just tell people who want to send me stuff that the iPhone doesn’t support it, because it basically doesn’t.

There are other annoyances as well, like the inability to read email in landscape view, the crippled map application (ask in the comments if you want more on that one), and the App Store closes just because you initiated an upgrade (forcing you to go back into the store manually if you, say, have more than one app to upgrade), and more!

I bought my iPhone before the 3G came out, but the 3G doesn’t address these issues either. I’ll probably get a new phone in about one year from now, and if these issues still aren’t solved, it is not likely (given the rate that competition is catching up to the other *useful* iPhone features) that I’ll get an iPhone. So if you’re thinking about developing an app for Android, do it!

Clone a table in MySQL without killing your server

So, I recently ran into one of those situations where a customer complains that his MySQL database is slow, and “it worked great until about two weeks ago”. The first thing I look at in these situations is not the queries or any code, but the indexes. Once I determined that the indexes were almost nonsensical, I had a look at the queries themselves, which proved that the indexes were, in fact, nonsensical. Running the queries as written in the code, from a mysql shell, with EXPLAIN, I was able to further confirm that the indexes (most of them, anyway) were never used.

Easy right? Just reindex the table!

NOTE: I’m going to skip the talk about all of the database design issues on this gig and just go straight to solving the immediate problem at hand. Just know that I had nothing to do with the design.

But, supposing this table has 15 million rows and is running on a machine with 2GB of RAM and only a  single (well, mirrored) drive? It will take forever to reindex that table, and the machine can’t be made completely unavailable at any time, either by driving the load up so high that the Linux OOM killer reaps the mysql process, or by putting a lock on the table for, oh, 3 days or so :-)

The solution is to create a brand new table, which I’ll call “good” using the “SHOW CREATE TABLE” output from the bad table, which I’ll call “bad”. I do this right in the shell, actually. I run “SHOW CREATE TABLE bad”, cut and paste the output, remove the part that defines the indexes, rename the table to “good”, and bam! New, empty table.

Of course, you still have to define your new indexes, so run whatever statements are needed to do that. You might even want to populate it with some test data (say, 10000 rows from the bad table) to test your new indexes are being used as expected (or that they can be by altering the query and getting back the same results… but faster).

Once done, it’s time to start copying rows from bad to good. I’ve written a shell script to help with that part. It’s designed to run on a Linux host running MySQL.

The variables at the top of the script are pretty self-explanatory, except to note that there are separate NEWDB and OLDDB variables in case your new table also lives in a new database. The INCREMENT is the number of rows you want to copy over at a time. If you set INCREMENT to 1000, it’ll copy 1000 rows, check the load average, and if it’s under MAXLOAD, it’ll copy over another 1000 rows. It also keeps track of the number of rows in each database as it goes, since writes are still happening on the bad table while this is going on.

So here’s my nibbler script, in shell. I would’ve written it in Python, but it wasn’t my machine, and I couldn’t install the python mysql module :-/

#!/bin/bash

###
### Written by Brian K. Jones (bkjones@gmail.com)
### 
### Takes an increment, old db, and new db, and copies rows from olddb to newdb. 
### Along the way, it'll check system load and sleep if it's too high. 
### There's too much hard-coding right now, but it's a useful template, and 
### has been tested. The script takes no CLI arguments. 
###

INCR=10000
NEWDB=shiny
OLDDB=busted
OLDTABLE=bad
NEWTABLE=good
MAXLOAD=3
DBUSER=mydbuser
DBPASS=mydbpass

rows_old=`mysql -N -D ${OLDDB} -u ${DBUSER} -p${DBPASS} -e "select count(*) from ${OLDTABLE}"`
echo "rows_old is now ${rows_old}" 

rows_new=`mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e "select count(*) from ${NEWTABLE}"`     ## num. rows in new table
echo "rows_new is now ${rows_new}" 

for (( rows_new=$rows_new; rows_new < $rows_old; rows_new+=$INCR )); do
        if [ $((rows_old - (rows_new + INCR))) -gt 0 ]; then         ## Check to see if there are at least $INCR rows left to copy over. 
            mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e "INSERT INTO ${NEWTABLE} SELECT * FROM ${OLDDB}.${OLDTABLE} LIMIT ${rows_new},${INCR}"
      
            while [ "`awk -v max=${MAXLOAD} '$1 > max {print "TRUE"}' /proc/loadavg`" = "TRUE" ]; do 
               echo "sleeping due to load > ${MAXLOAD}"
               sleep 30
            done
            # we update rows_old because it'll be growing while this script runs. 
            rows_old=`mysql -N -D ${OLDDB} -u ${DBUSER} -p${DBPASS} -e "select count(*) from ${OLDTABLE}" `
            rows_new=`mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e"select count(*) from ${NEWTABLE}"`
            time=`date +%R`
            echo "${time} -- rows_new = ${rows_new}, rows_old = ${rows_old}"  

        else                           ## There are < $INCR rows left. Select remaining rows. 
            remaining=$((rows_old - rows_new))
            mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e "INSERT INTO ${NEWTABLE} SELECT * FROM ${OLDDB}.${OLDTABLE} LIMIT ${rows_new},${remaining}" 
            echo "All done!" 
            exit
        fi
done

Practical Django Projects Trouble?

I picked up Practical Django Projects by James Bennett just the other day. The book walks through the building of a few different projects, and tries to lead you down the path of “best practices” along the way. However, the book’s publisher decided to release the first edition of this book on the heels of the Django 1.0 release, and the book is specific to pre-1.0 versions of Django.

If you’re kinda new to Django, you’re likely to be lost. I had only done a couple of really simple tutorial projects before picking up this book, and was really unnerved when I hit a seemingly impossible obstacle on page 24 — the dreaded TinyMCE integration. At that point in the book, it starts to become really obvious that this is not a 1.0-compliant Django book.

If you’re having trouble with this book and you’re using Django 1.0, I can give you two tips that are saving my bacon as I work through these bumps:

  1. Go to the porting guide for 1.0, read it, and bookmark it. Better yet, just leave it open in your browser all the time.
  2. For the TinyMCE issue, there’s more than one possible issue — but if your development server is giving you 404s after doing a redirect (302) trying to find the tiny_mce.js file, move the relevant line in the URLConf up. The redirect is a hint that your url didn’t match anything.

The porting guide fixes at least 3 issues in the book by page 40.

Also, if you’re a django n00b looking for help, I started a #django-newbie IRC channel on irc.freenode.net. Come join us!

There’s now a #django-newbie chan on freenode

So, I’ve really just had it with the #django chan on freenode. I’ve gone there several times for help only to be treated like whale turd, essentially because I’m a newbie to django I suppose. They seem to have an issue with that there. It’s not that you can’t get answers there, mind you, it’s that the answers are served up cold and begrudgingly with generous side portions of arrogance and self-importance. This isn’t a judgment of every person on the chan — it’s the general take-away I’ve personally had.

However, I really like django :-)

So, rather than just whine and then sit idly by hoping things change, I started a chan for django newbies called, appropriately, #django-newbie on freenode. If you’re a newbie, come join in! If you’re not, come help the newbies find the docs they need to get where they want to go.