Archive for the ‘Python’ Category

Clone a table in MySQL without killing your server

Thursday, October 9th, 2008

So, I recently ran into one of those situations where a customer complains that his MySQL database is slow, and “it worked great until about two weeks ago”. The first thing I look at in these situations is not the queries or any code, but the indexes. Once I determined that the indexes were almost nonsensical, I had a look at the queries themselves, which proved that the indexes were, in fact, nonsensical. Running the queries as written in the code, from a mysql shell, with EXPLAIN, I was able to further confirm that the indexes (most of them, anyway) were never used.

Easy right? Just reindex the table!

NOTE: I’m going to skip the talk about all of the database design issues on this gig and just go straight to solving the immediate problem at hand. Just know that I had nothing to do with the design.

But, supposing this table has 15 million rows and is running on a machine with 2GB of RAM and only a  single (well, mirrored) drive? It will take forever to reindex that table, and the machine can’t be made completely unavailable at any time, either by driving the load up so high that the Linux OOM killer reaps the mysql process, or by putting a lock on the table for, oh, 3 days or so :-)

The solution is to create a brand new table, which I’ll call “good” using the “SHOW CREATE TABLE” output from the bad table, which I’ll call “bad”. I do this right in the shell, actually. I run “SHOW CREATE TABLE bad”, cut and paste the output, remove the part that defines the indexes, rename the table to “good”, and bam! New, empty table.

Of course, you still have to define your new indexes, so run whatever statements are needed to do that. You might even want to populate it with some test data (say, 10000 rows from the bad table) to test your new indexes are being used as expected (or that they can be by altering the query and getting back the same results… but faster).

Once done, it’s time to start copying rows from bad to good. I’ve written a shell script to help with that part. It’s designed to run on a Linux host running MySQL.

The variables at the top of the script are pretty self-explanatory, except to note that there are separate NEWDB and OLDDB variables in case your new table also lives in a new database. The INCREMENT is the number of rows you want to copy over at a time. If you set INCREMENT to 1000, it’ll copy 1000 rows, check the load average, and if it’s under MAXLOAD, it’ll copy over another 1000 rows. It also keeps track of the number of rows in each database as it goes, since writes are still happening on the bad table while this is going on.

So here’s my nibbler script, in shell. I would’ve written it in Python, but it wasn’t my machine, and I couldn’t install the python mysql module :-/


#!/bin/bash

###
### Written by Brian K. Jones (bkjones@gmail.com)
###
### Takes an increment, old db, and new db, and copies rows from olddb to newdb.
### Along the way, it'll check system load and sleep if it's too high.
### There's too much hard-coding right now, but it's a useful template, and
### has been tested. The script takes no CLI arguments.
###

INCR=10000
NEWDB=shiny
OLDDB=busted
OLDTABLE=bad
NEWTABLE=good
MAXLOAD=3
DBUSER=mydbuser
DBPASS=mydbpass

rows_old=`mysql -N -D ${OLDDB} -u ${DBUSER} -p${DBPASS} -e "select count(*) from ${OLDTABLE}"`
echo "rows_old is now ${rows_old}" 

rows_new=`mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e "select count(*) from ${NEWTABLE}"`     ## num. rows in new table
echo "rows_new is now ${rows_new}" 

for (( rows_new=$rows_new; rows_new < $rows_old; rows_new+=$INCR )); do
        if [ $((rows_old - (rows_new + INCR))) -gt 0 ]; then         ## Check to see if there are at least $INCR rows left to copy over.
            mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e "INSERT INTO ${NEWTABLE} SELECT * FROM ${OLDDB}.${OLDTABLE} LIMIT ${rows_new},${INCR}"

            while [ "`awk -v max=${MAXLOAD} '$1 > max {print "TRUE"}' /proc/loadavg`" = "TRUE" ]; do
               echo "sleeping due to load > ${MAXLOAD}"
               sleep 30
            done
            # we update rows_old because it'll be growing while this script runs.
            rows_old=`mysql -N -D ${OLDDB} -u ${DBUSER} -p${DBPASS} -e "select count(*) from ${OLDTABLE}" `
            rows_new=`mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e"select count(*) from ${NEWTABLE}"`
            time=`date +%R`
            echo "${time} -- rows_new = ${rows_new}, rows_old = ${rows_old}"  

        else                           ## There are < $INCR rows left. Select remaining rows.
            remaining=$((rows_old - rows_new))
            mysql -N -D ${NEWDB} -u ${DBUSER} -p${DBPASS} -e "INSERT INTO ${NEWTABLE} SELECT * FROM ${OLDDB}.${OLDTABLE} LIMIT ${rows_new},${remaining}"
            echo "All done!"
            exit
        fi
done

Practical Django Projects Trouble?

Wednesday, October 8th, 2008

I picked up Practical Django Projects by James Bennett just the other day. The book walks through the building of a few different projects, and tries to lead you down the path of “best practices” along the way. However, the book’s publisher decided to release the first edition of this book on the heels of the Django 1.0 release, and the book is specific to pre-1.0 versions of Django.

If you’re kinda new to Django, you’re likely to be lost. I had only done a couple of really simple tutorial projects before picking up this book, and was really unnerved when I hit a seemingly impossible obstacle on page 24 — the dreaded TinyMCE integration. At that point in the book, it starts to become really obvious that this is not a 1.0-compliant Django book.

If you’re having trouble with this book and you’re using Django 1.0, I can give you two tips that are saving my bacon as I work through these bumps:

  1. Go to the porting guide for 1.0, read it, and bookmark it. Better yet, just leave it open in your browser all the time.
  2. For the TinyMCE issue, there’s more than one possible issue — but if your development server is giving you 404s after doing a redirect (302) trying to find the tiny_mce.js file, move the relevant line in the URLConf up. The redirect is a hint that your url didn’t match anything.

The porting guide fixes at least 3 issues in the book by page 40.

Also, if you’re a django n00b looking for help, I started a #django-newbie IRC channel on irc.freenode.net. Come join us!

There’s now a #django-newbie chan on freenode

Wednesday, October 8th, 2008

So, I’ve really just had it with the #django chan on freenode. I’ve gone there several times for help only to be treated like whale turd, essentially because I’m a newbie to django I suppose. They seem to have an issue with that there. It’s not that you can’t get answers there, mind you, it’s that the answers are served up cold and begrudgingly with generous side portions of arrogance and self-importance. This isn’t a judgment of every person on the chan — it’s the general take-away I’ve personally had.

However, I really like django :-)

So, rather than just whine and then sit idly by hoping things change, I started a chan for django newbies called, appropriately, #django-newbie on freenode. If you’re a newbie, come join in! If you’re not, come help the newbies find the docs they need to get where they want to go.

A merger, migration, mysql, python, and more news

Tuesday, September 30th, 2008

First, AddThis.com (where I was the director of IT) and Clearspring have merged! A side effect of that is that I’m now (happily, on purpose, by choice) a full-time consultant! I’ll have a web site up soonish. Until then, check back here for updates. If you’re a tech firm who needs help, and don’t mind remote workers, send mail to bkjones at Google’s mail service (.com).

Some folks thought I’d passed away due to the uncharacteristic lull in posting frequency on this blog. I’m very much alive — but working for a startup and maintaining a consulting business simultaneously is hard, especially when two large projects fall into your lap at the same time. So what have I been up to?

Well, as part of the now-public merger between the company I worked for and the new company, I was doing the migration of our infrastructure to theirs. That involved rewriting some backup scripts, writing a data synchronization routine (complete with backout capability — I’ll post code samples after I clean out all the site-specific stuff — it’s python and MySQLdb!), set up a different (and kinda cool) MySQL replication scheme, a different (and also kinda cool) failover scheme, test, test, test, coordinate with everyone down in VA at the new company to make sure everything was working and in place, and then “flip the big ol’ switch”.

Now I’ll be writing even more scripts, planning even more migration of infrastructure services components, doing more testing, and retiring old AddThis assets.

I’m happy to say that the folks at Clearspring have been an awesome team to work with. The culture there is kind of like “nobody here is any better than anybody else”, and it has a rather dramatic effect on productivity compared to other places where everyone thinks they’re not allowed to be wrong and have to preserve their jobs and make other people look stupid. Nope, if you’re a junior guy or someone from another department with a good idea, and it works, that’s great! It’ll be poked and prodded at, and if it passes muster, it’ll be deployed. And why shouldn’t it be? Generally, because egos get in the way. In my book, if you don’t stand in the way of something great, you’re more of a hero than if you do. The ideas don’t all have to be yours.

And, get this, they document their stuff! And it’s easy to use and browse! I’m on a documentation project right now for another client, and it’s challenging just to get people to talk about documentation. :-/

So what’s next for me? I’m honestly not sure at the moment. I’m consulting, it’s going well, I have more than enough work, but I also have a little bit of ‘the bug’, so don’t be surprised if I come back here and tell you that I’ve joined some “social multimedia web 2.0 mindmapping in the cloud with sharing” company or something like that in the near future.

More Adventures in Amazon EC2 (and EBS)

Wednesday, August 27th, 2008

Short Version: You can find a fantastic video here about bundling customized AMIs and registering them with Amazon so that you can launch as many instances of your new AMI as you want. The video is so good that I don’t bother writing out the steps to do the bundling (it would be pretty darn long). These are some notes about launching an AMI, customizing it, and mounting an EBS volume to it (the video linked above doesn’t cover EBS). Also, check out the ElasticFox tool which is a very good GUI for doing simple EC2 operations. Nice if you’re just getting started or doing some simple tests.

There are two ways you can go about creating a custom machine image (AMI) for use with Amazon EC2: You can create an image locally by dd’ing to a file, mounting it with “-o loop” creating a filesystem on it, and bootstrapping the whole thing yourself, or you can grab an existing AMI that will serve as a “good enough” base for you to make your customizations, then bundle the customized image.

I’ll be talking about the latter option, where you identify a “good enough” image, customize it for your needs, and save that as your AMI. Unless you’re doing some kind of highly specialized installation, or are a control freak, you shouldn’t really need to start from scratch. I was just building a test image, and wanted a CentOS 5.2 base installation.

Here’s the command you can use to browse the AMIs you have access to (they’re either public, or they’re yours):

$ ec2dim -a

If that command looks funny to you, it’s likely because you’re used to seeing the really long versions of the AWS commands. Amazon also provides shorter versions of the commands. No, really - have a look! The long version of this command is:

$ ec2-describe-images -a

Too long for my taste, but it’s nice to know it’s there.

So, rather than start from scratch, I grabbed a base image that was close enough for my needs, and customized it. It’s a 5.1 base image, pretty well stripped of things that I don’t need, and a few that I do, but that’s ok. I’d rather start with less than more.

So step one is to launch an instance of the AMI I’ve chosen to be my ‘base’. Simple enough to do:

$ ec2run ami-0459bc6d -k ec2-keypair

And that’s pretty much it. It takes a couple of minutes (literally) for the machine to actually become available. You can check to see if it’s still in “pending” state or if it’s available by running ‘ec2din’. Without arguments, that’ll show you the status of any instances you have pending or runnning. Once the instance is running, you’ll be able to glean the hostname from the information provided.

An important note at this point: Don’t confuse “image” with “instance”. For the OO types in the crowd, an “image” is an object. It does nothing by itself until you instantiate it and create an “instance” of that object. For sysadmins, the “image” is like a PXE boot image, which does nothing until you boot it, thereby creating an “instance”.

The reason I used “PXE” and “object” in the above is because of the implication it makes: you can launch as many instances of an object as you want from a single object definition. You can boot as many machines as you want from a single PXE boot image. Likewise, you can launch as many Amazon EC2 instances from an image as you want.

So, in the time it took you to read those last two paragraphs, your instance is probably running. I now grab the hostname for my instance, and ssh to it using my keypair:

$ ssh -i ec2-keypair root@<hostname>

Now that I’m in, I can customize the environment, and then “bundle” it, which will create a new AMI with all of my customizations. With the instance in question, I installed a LAMP stack, and a few other sundry tools I need to perform my testing. I also ran “yum -y upgrade” which will go off and upgrade the machine to CentOS 5.2.

One thing I want to do with this instance is test out the process for creating an EBS volume. The two pieces of information I need to do this are the size of the volume I want to create, and the “zone” I want to create it in. You can figure out which zone your instance is running in using ‘ec2din’ on your workstation (not in your instance). I took that information and created my image in the same zone using the ‘ec2addvol’ command. If you don’t have that command on your workstation, then you don’t have the latest version of the Amazon command line tools. Here’s the command I ran:

$ ec2addvol -z us-east-1b -s 25

To see how it went, run ‘ec2dvol’ by itself and it’ll show you the status of all of your volumes, as well as the unique name assigned to your volume, which you’ll need in order to attach the volume to your instance. To do the ‘attachment’, you need the name of the volume, the name of the instance (use ‘ec2din’), and you need to choose a device that you’ll tell your instance to mount. Here’s what I ran (on my workstation):

$ ec2attvol -d /dev/sdx -i i-xxxxxxxx -v vol-xxxxxxxx

Now you can go back to the shell on your instance, mount the device, create a file system, create a mount point, add it to fstab, and, as they say in the UK, “Bob’s yer uncle”. By the time I wrote this post, I had already shut down my instance, but here are the commands (caveat emptor: this is from memory):

# mkfs.ext3 /dev/sdx
# mkdir /vol
# mount /dev/sdx /vol

If that all works ok, you can add a line to /etc/fstab so that it’ll be mounted at boot time, but I haven’t yet figured out how to attach a volume to an instance at boot time. The mount doesn’t work if you don’t attach the volume to the instance first. You’ll get a “device doesn’t exist” error if you try it. Clues hereby solicited. I assume I could probably use ‘boto’ and some Python code to get this done, but doing the same with a shell script wrapper around the Amazon tools might also be possible — but I don’t know how reliable that would be, because you’re at the mercy of Amazon and how they decide their tools should present the data (and *if* they provide the data you need for a particular operation down the road).

So now I have an EBS volume, and an instance. The volume is attached to the instance, and I can do things with it. I’m testing some database stuff, so I copied a database over to the volume, which was now mounted, so I could just ’scp mydb.tbz root@<instance>:/vol/.’

Once my database is there, I can attach it to pretty much whatever I want, which makes it nice, because I can test the same database, and the same database code, and see how the different size Amazon instances affect the performance, which gives me more performance data to work with. For production purposes, I’ll have to look more closely at the IO metrics, play with attaching multiple volumes and spreading out the IO, and I also want to test the ’snapshot’ capabilities. It’s also nice to know that if I needed to launch this in production (there are no plans to do so, but you never know), I could upgrade the database “hardware” more or less instantly :-D

If anyone has code or tools to help automate the management of all of this stuff, please send links! If I come up with any myself, I’ll most likely post it here.

Now that I have a customized AMI with all of my packages installed and my config changes made, I need to bundle this so that I can boot as many instances of this particular configuration as I want. An important note about bundling this *particular* image is that you MUST run ‘depmod -a; modprobe loop’ before bundling, since this process basically abstracts the manual process of bundling an image, which involves mounting a file as a volume, which requires a loopback mount.

The video I used to do the bundling is here, and if you can live through the disgustingly bad burps and chirps in the (Flash version) audio, it’s an excellent tutorial for bundling custom AMIs. While the process *is* pretty straightforward, it involves a number of steps, and the video goes through all of them, and it worked perfectly the first time through.

PyWorks Conference Schedule Posted

Wednesday, August 27th, 2008

Hi all,

The schedule for PyWorks has been posted! I’m really excited about three things:

1) there are some really cool talks that I’m looking forward to attending. There are a couple of sysadmin-related talks, AppEngine, TurboGears, Django, and an area I’ve been especially slow to move into: testing (I know, shame on me). There’s lots more so be sure to check it out.

2) the conference scheduling process is over ;-)

3) I get to meet a lot of people face-to-face that I’ve worked with in the past on Python Magazine developing articles, or interacted with on IRC, etc. One thing I like about conferences surrounding open source technologies is you get to thank people face-to-face for the sweat they poured into some of the tools I use regularly. Mark Ramm, Kevin Dangoor, Michael Foord, Brandon Rhodes, and a collection of Python Magazine authors will be speaking there, and other Python Magazine folks and generally familiar faces will be in attendance.

Enjoy!

For those still unaware, PyWorks will be held in Atlanta, Nov. 12-14, 2008. It’s sponsored by MTA, the publisher of Python Magazine, as well as php|architect. In fact, the php|works conference will be held simultaneously with PyWorks, and attendees of one are free to access talks in the other at will. There will also be a “center track” that will cover some more generic topics of interest to developers without regard to the language in use. Check it out!

Why I Don’t Write Book Reviews

Friday, August 8th, 2008

I have a lot of interaction with publishing types. I write a lot, and I edit some, and I do tech reviews and stuff for some publishers, and I co-authored a book, and I’ve worked on two magazines, and a newspaper, and I’m generally fascinated by the technical book market and stuff like that. I’m also someone who is lucky enough that his job is also his hobby. I work in technology, and am always doing something technology related at home in my spare time. Needless to say, I read tons upon tons of technical books.

I almost never post book reviews, in spite of the fact that I read all of these books. Why? Well, to be honest, I couldn’t tell you. It just hasn’t occurred to me to write a book review. Could be because I don’t really value book reviews too much myself I guess. I mean, if there’s a really obvious consensus across a huge number of reviews, I might be swayed. But in general, I find that book reviews are too often the target of astroturfing campaigns.

If there’s a tech book you’d like a review of that deals with things I’m generally into, let me know and I’ll post a review, if I’ve read it (or want to read it). Here are subjects I’m likely to have read books about in the past couple of years:

  • Linux, UNIX, and administration thereof
  • Python (all levels — I just read pretty much whatever is out there)
  • web 2.0 APIs (mostly Google and Amazon)
  • Any book about any service that can be run in a *x environment (DNS, Apache, DHCP, Jabber, and most other things that open a port)
  • Anything related to generic SQL, database design, or (more specifically) mysql and postgresql.
  • HPC (cluster computing)
  • Generic programming, software, computer science, or high-level systems design books
  • Digital photography (I have a Canon Digital Rebel, if that helps — I do *not* use Photoshop)
  • PHP
  • Maybe some other stuff I’m forgetting

Design Patterns in System Administration

Sunday, August 3rd, 2008

Most readers of my blog know that I consult, in addition to usually having a day job. I started my career working for a consulting firm, and couldn’t let go of the endless fascinating problems that exist in the “technological landscape”, and in addition, the seemingly endless numbers of ways to solve them. I’ve learned more than tons about how people, and institutions, approach technical problems in system design, and maybe more importantly, how they think about the problems and solutions.

I’ve worked in huge enterprises (several Fortune 100 companies), academia (cs.princeton.edu, to be more exact), government (gfdl.noaa.gov, for example), and a few startups and small businesses. I also grew up around small business around the time that technology was starting to become affordable enough to creep into even small offices (I helped run wire for my father’s first modem-connected office network around 1988 or so, and my mother’s office — admittedly much larger — had a mainframe and a few terminals, from which ascii posters of JFK and MLK were printed and hung on my walls when I was as young as 7 or 8). Observing and working with people to solve technical problems continues to bring me a lot of joy, and present plenty of challenges.

Over the years, I’ve done a decent bit of what I’ll loosely call “programming”. 10 years ago I might not have qualified that, but working for 6 years in support of graduate computer science research has a way of humbling a guy (and, really, for most grad students, actually *doing* 6 years of graduate research is probably just as humbling, if not more). One thing I’ve tried to do is keep up with trends in how programs are deployed, how the teams of workers in what are considered separate problem domains interact to get the applications to be useful to people, how the systems are organized, and how programs are designed, and finally, how to program…. um… “better” (for some undefined but surely long-winded definition of that term). As I’m starting to witness something of a convergence of programming and systems work (at least in my neck of the woods), programming is something I’m spending even more time doing, and learning.

Design patterns, whether in the context of extreme programming, agile methodologies, or whatever the project management philosophy is, appear to be extremely useful, but I’ve wondered why there doesn’t appear to be any movement in the system administration community toward defining some patterns for solving problems in the realm of systems infrastructure architecture. A few years back I stumbled upon infrastructures.org, which I think is an excellent general methodology for building infrastructures, but I think a fuller treatment of the topic could be had. Preferably one that addresses a broader set of problems prevalent in a wider variety of environments. For example, I found the tools and methodologies there to map perfectly in government and academic environments, and portions of that work can be mapped onto small business problems, but it leaves enterprise environments, and some larger government environments with some unanswered questions or unaddressed problems.

I don’t blame the folks at infrastructures.org — on the contrary, I applaud their work! But why has it been so difficult to find solutions to problems those nice folks just didn’t have, or didn’t have to focus on in their part of the organization?

So much of what we do is tribal knowledge, or knowledge earned “the hard way” — in the trenches, at 4am, on a Sunday, uphill, both ways… etc., but while many of these stories sound similar enough to discern a pattern, and while horror stories at conferences are universally met with “me too”, and “you should’ve done x, y and z, and it wouldn’t have been an issue”, I have yet to see these patterns codified in any meaningful way in a single work, or perhaps, an organized volume of works (no, mailing lists do *not* constitute an organized volume of works).

If something as complex and diverse as programming can have patterns applied to it, I have to believe that the same could hold true for building systems. If there were such a work, it could potentially serve as a de facto “best practices” reference — one that could be referred to by both technicians and higher-level decision makers, define a common language that both could understand, and help overcome some of the inevitable “people issues” that sysadmins (and, indeed, managers) often blame for a lack of forward movement.

Does such a work exist? Is this in the works now? Though I try to keep my finger on the pulse of the publishing market, I have yet to see any real commitment to the idea that a large swath of problems in systems can be solved using variants of pre-defined patterns. It’s not that we’re not using them, of course, and it’s not that there aren’t large numbers of us who could probably recite them off the top of our heads, but if you’re one of those people, you’re a “senior” system administrator (or better), and if that’s the case, imagine what your career might’ve been like if you had such a reference, and also, let me know what the “you” with 1 year of sysadmin experience would’ve loved to have, or what the “you” of today would love to see the junior folks reading.

Is it time to switch VM apps on OS X?

Tuesday, July 29th, 2008

Thought I’d throw this out to the lazyweb for advice. I’ve been running OS X as my primary desktop OS since I got my macbook pro around May 2006 or so. I installed Parallels as soon as I found it, and it was great. I still have it installed. However, I’ve found two issues with Parallels in the past two years:

1. There’s no kind of easy upgrade path for existing parallels users that I can find. Their web site is a mess, (like most software vendors, oddly) and there’s no upgrade license that I could find.

2. It’s non-trivial to install of versions of linux that are released after the version of parallels you’re trying to install to, for whatever reason.

I remember having issues with installing an Ubuntu version, and a CentOS version, both newer than the parallels I had installed. Then I upgraded parallels, and they installed fine.

So today I tried to do the impossible: install CentOS 5.2 on a build of Parallels from March 2007. I get a kernel panic, which I remembered a hack for (add “linux kernel agp=off” to the boot: line), but even that didn’t have any effect. Guess I just pushed the limit too far this time.

So, since there’s no upgrade license to make it cheaper for me to stay with Parallels, I have no particular reason to stay loyal to them. However, I cannot find a single mention of Linux on VMWare Fusion’s web site (?!), so….

What’s everyone doing about virtualization on the OS X (Leopard) Desktop?

The promise of Drizzle

Sunday, July 27th, 2008

I got to actually speak to Brian Aker for maybe a total of 5 minutes after his micro-presentation about Drizzle, which took place at the Sun booth at OSCON 2008. I was a bit nervous to ask what questions I had out loud, because the things I had wondered about were things I really didn’t see too much discussion about out in the intarweb. I’m happy to report that, if Brian Aker is to be considered any kind of authority (hint: he is), my ideas are not completely ridiculous, so maybe I’ll start talking a bit more about them.

UPDATE: lest anyone get the wrong idea, Brian Aker did, in fact, state that views are not on the short list of priority items for Drizzle, but he did say that views are one of the features he finds most useful, and that they’d probably be higher on any future priority list than, say, stored procedures. So, take my notes below about views with a grain of salt. It’s not necessarily “coming”.

My three ideas were these:

  1. Materialized views: my experience with views in MySQL is that they just plain old don’t scale well compared to other database systems I’ve used. I used Sybase in 2000 and views scaled better for me then than MySQL views do now, and I’m using them in mostly the same way (which is to say that I’m not using them to do evil things - I’m using them in the way most of the database community agrees they should be used). In the past, I thought materialized views were “nice to have”, but now that I’m working with much larger data sets, without a need for my reporting to be 100% real-time, materialized views would be great. To be honest, I win either way with Drizzle in all likelihood, as Brian Aker has proclaimed that views in Drizzle will not look like views in MySQL. He confirmed that materialized views would be a great thing to have a closer look at, and I was happy with that reply.
  2. “Query Fragments”: I didn’t know they were called query fragments. What I explained to Brian Aker was that I wanted to harvest subsets of cached result sets. So, for example, if I do a date range query (I do that a lot), and the result set is cached, and then my app does another query which is identical save for that the date range is a subset of the one in the cached result set, I’d like to grab that data from the cache. My actual question to Brian was “can this be built in such a way that this would be a reliable, trustworthy result coming from some middle tier component?” And that’s when he told me about query fragments. He said that this idea was not at all crazy, and was also worthy of further discussion.
  3. Vertical data-padding. Well… that’s what I’m calling it. Here’s the logic: I have lots of temporal data. Tons of it. My queries are largely things like “for this user, show me all of the foo’s bar’d — group by day — where day is between these two days”. MySQL is good at these kinds of queries (assuming you index well and, basically, that you’ve read “High Performance MySQL” a couple of times… per edition), but there’s something missing. When I get the result set back, any of the days for which no foo’s were bar’d, I don’t get a record. This is perfectly within the realm of reasonable behavior, but I’ve never known MySQL to let things like being reasonable get in the way of helping their users, so my suggestion was that MySQL, in order to do the comparisons for the “WHERE” clause, *MUST* know what dates fall within the dates in, say, a “BETWEEN” statement. It would then seem logical to have some way to tell MySQL “If one of those dates has no value, return a NULL” (or 0, or an empty string, or something). I don’t know the real name for this proposed feature, but I call it “vertical data-padding” because you’re padding columnar data, which, in my mind, is visualized vertically. Just like when you do a “GROUP CONCAT” or something, I would refer to that as horizontal data padding. I explained to Brian that one way I’ve seen this handled is to have a lookup table of static dates that gets joined to the main data table. You do a left outer join with the date lookup table on the left, and you get back a row for every date whether there’s data in the right-hand table or not. This works when ‘n’ is small (like everything else), but it’s hurrendous when you have, say, 10 million rows to deal with. Then you’re in what I call the “No Bueno Zone”. Brian seemed interested in the problem, and I’ll be discussing that with him further when his life settles down a bit (he’s been at OSCON, and he’s still settling in at Sun).

I want to thank Brian Aker for his enthusiastic attitude toward helping others, and for all of the work he pours into all of this stuff. I also want to say that, for me, Drizzle is really exciting, not so much because the feature set is more or less cherry picked to map onto what I do for a living, but also because it represents an opportunity to get ideas in the door before a lot of legacy cruft makes it impossible to implement these somewhat idealistic features without rocking the boat for the millions of users already launching it into production.