Hadoop introductory video

This is a great talk with the head of Yahoo’s grid team that talks about the open source project Hadoop, which is an open source distributed file system and MapReduce implementation. The video is long and interspersed with Yahoo! specifics you might not care about – but keep watching, because they swing back to talking tech and about how you can write MapReduce programs in whatever language you want, and how you can do actual Hadoop programming using Python.

I’m excited about Hadoop, and I have tons of data to work with, but don’t currently have the cycles to devote to testing and playing with it.  I hope to be able to at least get something up and running with it soon!

Python Magazine on IRC

It occurs to me that lots of people have no idea that Python Magazine has an IRC channel!

Since I know a lot of our authors, and even more of our readers, actually use IRC, we’ve created a channel called #pymag on irc.freenode.net (FreeNode in most clients). The goal of the channel is to be a place where readers, authors, and the PyMag editorial staff can all interact about the articles that appear in Python Magazine. It’s also a place to trade ideas for new articles, meet other people who read or are involved in the magazine, and talk shop.

Note that questions about your subscription or delivery of the print edition, or anything else not directly related to the content in the magazine should still be sent to info ~at~ pythonmagazine ~dot~ com.

Hope to see you there!

Dear VirtueDesktops Guy

Note: if you’re a Leopard user wishing Apple’s Spaces was never invented, and you’d pay to have VirtueDesktops back, please leave a ‘me too’ post in the comments!

UPDATE: See this link to make Spaces not *completely* useless: http://www.macosxhints.com/article.php?story=2008021122525348

Please pick development back up for VirtueDesktops. Apple’s spaces not only falls short, it totally sucks. I would be willing to pay $20 for VirtueDesktops on Leopard right now, and I know a whole bunch of other people would, too. Everyone I know who used VirtueDesktops in Tiger and is now stuck with Spaces says they hate it. I have pretty much stopped using any kind of desktops at all because Spaces is just frickin’ unusable.

I’m not even requesting feature enhancements. If it crashes or spaces out, or locks up my machine, ok – I’d like that fixed, but feature-wise, VirtueDesktops is great as-is, and kicks Spaces’ buttocks. With spaces, I can’t label my desktops, and Spaces wants to flip back and forth between desktops every time I switch application focus because it doesn’t understand why I’d ever open firefox on two different desktops. So, whereas in VirtueDesktops I could be on a random, empty desktop, click the firefox icon, and get the top menu to change to firefox, doing the same thing in Spaces brings me to whatever desktop Firefox is open on. From there, I can launch a new firefox window, then go to the stupid desktop viewer and drag the new window from one space to another.

C’mon man, this is no way to be productive. Help us out.

I know, I know, I can start VirtueDesktops using sudo or chmod the binary, and probably come up with more hacks as Apple changes more stuff out from under me, but it’d be way nicer to have VirtueDesktops back in a state where it (for the most part) “Just Works(tm)”. Also, not everyone knows what ‘chmod’ is :-)

If it’s not worth your time, then take away the donation thing and replace it with a charge-per-download thing for $20.
Please?

Blogged with Flock

Do You, Um…. Brew?

A long time ago, a buddy of mine got me into brewing beer. I’m not talking about buying a can o’ syrup and adding water – I’m talking about buying grain, taking it home, milling it, mashing it, hopping it, fermenting it, bottling and kegging it…. *really* brewing beer.

Anyway, it turns out that a good number of the members of the local brew club in my area also work in IT. Since lots of other folks who visit here also work in IT, I figured I’d give a shout out to anyone who also brews, and point them at two things:

1. Some time ago, I started an IRC channel for home brewers. The channel is #homebrew on irc.freenode.net – please join us!
2. I’ve been the primary poster on the blog my buddy and I started to post our recipes and notes and stuff. You might find some useful stuff there http://www.bamfbeer.com

Blogged with Flock

AppLogic Cheat Sheet

I’ve been using AppLogic for exactly one month today, and I’ve learned a whole heckuva lot about what it takes to build an infrastructure using the AppLogic grid operating system. One of the very first things I learned is that there is just a TON of documentation, but a very large portion of it is really heady, high-level theoretical stuff. The real practical, nuts-n-bolts, ‘click-here-type-this’, how-to-style knowledge can be had from there, but it’s kinda lost in all the nebulous conceptual stuff. I got fantastic help from 3tera in their forums, as well as from TGL, which is our actual grid host, so I wanted to take a moment and write down some tips here for my own reference, and to help anyone else who is tired of searching forum posts for this information.

1. Creating placeholder volumes, at time of writing, isn’t documented, and is a really important operation you’re going to need to know how to do. The point of confusion is that when you use a catalog appliance, you can see in the “Volumes” tab of the class editor that there’s a volume mounted to /dev/hdaN, and maybe some other volumes defined by default. This sets the expectation that if *you* create a volume, map it to /dev/hdaN, and then map a User Volume to that, then when the appliance starts up again, everything will magically fall into place. This is not how AppLogic works. You *still* need to start the application, log into the appliance, create the mount point, and edit /etc/fstab to get the mount to actually happen. You can test if it works without restarting your application by running ‘mount -a’ after you edit /etc/fstab. If that doesn’t work, then you might’ve forgotten to save the application after you created the volume. Save and restart, and try again.

2. Getting shell access to the controller is one of the first things you’ll want to do after finding your way around. The controller gives you access to all applications, all components of those applications, and all volumes used by those components. It’s quite powerful. You can get a shell on the controller from the dashboard just after you log in, but it’s kinda slow and clunky (“java-like” if you will), so you’ll want to use your own terminal application. In order to access the controller, you *must* use an ssh key. You can’t log in with a password at time of writing (and I prefer it that way). Generate an ssh key, and either send your public key to a support person and tell them to put it in place, or you can try using the “Shell Login” button on the dashboard and running this command in the resulting window:

user set <username@domain.com> sshkey="ssh-dss AAAAB3NzaC...afdk5lqEGOfJJnM+L4="

Once you do that, you can log in as root@ip.of.controller.host from your local terminal application. If you’re using openssh, you can create a ~/.ssh/config file and put in the following so that running “ssh controller” will take you to the right place automagically:


Host controller
HostName ip.of.controller.host
IdentityFile id_rsa
User root

3. If you’ve set up SSH access to the controller, you can save yourself the trouble of setting up a development box or similar that acts as the only machine with write access to your web content by uploading content directly through the controller. Here’s the general syntax:

scp myfile root@controller:/app/<appname>/<component>/mnt/data/.

That will copy ‘myfile’ in the current working directory on my local machine to /mnt/data on the component named in <component> in the app named <app>.

Since the controller has write access to everything, all you need to know is the upload path, and your web developers can use sftp to manage the site’s files. I haven’t set up my developers yet – but I’m using this sftp/scp access to volumes mounted on *running* appliances already myself during the infrastructure building and testing phase. The ability to do this through the controller has saved me a lot of headaches, as well as some disk, network, cpu, and memory resources. It also keeps you from trying what I tried: creating a placeholder volume mounted read-only by the web servers, but read-write by a dev server. It is *documented* that this will corrupt the volume. The controller would appear to be an exception to that rule.

4. If you’re using the bash shell, you might find these client side macros for bash useful!

5. In setting up MySQL, you typically do a ‘GRANT <somerights> TO <someuser>@<somehost> IDENTIFIED BY…’ to give appropriate rights to your web application. However, you’ll find at this point an interesting tidbit: your database cannot resolve your web server, and can’t otherwise connect to it except in response to calls made by the web server to the database server over the terminal you’ve defined on the web server for the sole purpose of talking to the database server. This terminal, in all default template applications at least (and it’s recommended, I would imagine), is defined to only support calls to a mysql database, and can only connect to the database server. Further, assuming it’s the only thing connected to the database server, you *can* use ‘<someuser>@’%” in your GRANT statement. In MySQL, “%” is a wildcard meaning ‘any’. If you need to give different rights to connections coming in from different hosts to the same server and database on that server, I guess the expectation is that you do that by defining different users, which seems reasonable.

6. If your app fails, the first thing you should do is run ‘log list n=30′, which spits out the last 30 lines of the controller’s log. If a particular component causes your whole app to fail to start, you can stop the app, make changes to that component, mark the component as ‘standby’, bring up the app (which will now succeed because the troubled component isn’t in the startup list), and you can then use ‘component <name> start’ and ‘component <name> stop’ to see if it’s working properly. Also, try using ‘app start –debug’! Unfortunately, changes to the appliance in the GUI editor require an application restart to take effect. I hear they’re working on that :-/

7. Stuff I am still working on which isn’t documented but which I know is possible: creating assemblies (in other words, creating a single component from multiple logical components), and linking *applications* together as if they’re components, which would be awesome to be able to do, because then if you have, say, 4 web servers, you can group them into two separate applications, separate from everything in the main application, and if you need to make changes to the web servers that would require a whole-app restart, you can just restart the app containing the two web servers. I’m working on figuring both of these out, but it’s undocumented, and I’m not sure if it’s necessarily safe for production use or not. More learning to be done, as usual.

More tips? Put ‘em in the comments!!

Blogged with Flock

Identifying Database Badness

I started my career on the database end of the technological landscape, as a consulting database reporting specialist, then later as a consulting DBA for Sybase. It was there that I discovered that the real fun was in system administration, but I still have a deep love for data organization, modeling, and management, even though I don’t get to do quite as much of it anymore.

I’m sure that some of my skills are rusty (my last DBA consulting gig was almost 10 years ago), but not so rusty that I don’t know true database badness when I see it, and I’ve seen tons of it over the years in open source applications put together by developers who a) are not database people, and b) don’t realize they are not a database person, or worse, b) think they are in fact a database person, and so refuse to consult an *actual* database person.

If you’re not sure if your database exhibits signs of badness, I’ve put together some pointers that I hope will help you down the road to database goodness. Data modeling and data handling in general is an area that is heavily debated, so I expect the flames. All I ask is that you reflect for a moment on the comment I’m making in the context of the current general state of data modeling in the open source universe before you decide to burn me at the stake.

‘name’ and ‘value’ columns

I’ve seen this in more software than I’d like to admit, and not all of it open source.  It tends to crop up in places where there is an arbitrarily large number of attributes that each have a ‘name’ and a ‘value’. This seems like a great way to go at first, because instead of figuring out which attributes are assigned to a given object, you can just select ‘name’ and ‘value’ by some id that is presumably stored in the same table. Unfortunately, this is horribly bad form for a number of reasons:

  1. it doesn’t follow any normalization rules. Each column becomes functionally dependent on all of the others.
  2. you shatter the thought that tables represent entities and columns represent attributes and keys represent instances of those entities.
  3. You lose the ability to do even simple joins because what should be a column is now (maybe) a row.

That last item is the one that is likely to bite you first (and hardest). Suppose you have a user table like this:

id username
1 jonesy
2 kermit

Instead of storing the user preferences in the user table, you store it in a table like this:

id uid name value
1 1 email jonesy@somedomain.com
2 1 bldg COS
3 1 room 101
4 2 email kermit@somedomain.com
5 2 bldg COS

Great! Now, everyone in your user table has a desktop computer that’s been assigned to them. Like a good little DBA, you map the user id to the asset tag id for the machine, like so:

id uid tag
1 1 00999
2 2 00839

Ok! Now, write me a query that will give me the user name and email address for each tag.

Ewwwwwww.

If you’re a coder, you can probably do a couple of selects into various arrays and loop over them or do some kind of merge with a callback function or something to figure out which names and emails go with each tag, but how efficient is that? Supposing your open source project is installed at some huge company that has lots of people who have lots of desktops. Tens of thousands, even. This would not do at all!

Also, this is no way to insure data integrity, because your ‘value’ column is forced to be some kind of VARCHAR or TEXT field, because you have no idea what type of data is going in there. It will differ depending on the ‘name’ column. Of course, the front end application code has to account for this. While they’re at it, the front end coders will also have to check the spelling of every single potential value that can possibly go into the ‘name’ column to make sure there aren’t separate ‘email’ and ‘EMAIL’ values. And what if you decide that a user should only ever be assigned to a single building? Front end coders will have to validate for that, too, instead of using a built in UNIQUE constraint. This is just painful all over.

The right way to put a table together is to figure out what entity the table is about. A ‘user’ table is a table about users. Therefore, the primary key should be the primary identifier for ‘user’. The rest of the columns in the table are attributes of ‘user’. In the event that a user can have more than one value for a given attribute (for example, I might have more than one ‘email’ value), make a separate table that maps user id’s to email addresses.

NOTE:
In a lot of cases, the unique identifier is a serial ‘id’ column. There are pitfalls with that as well, but just know that an autogenerated id field handed to you by the database should not be the only way of uniquely identifying a row in your table. You should think of the id field more as a handle to a row that you’ve uniquely identified using some other means whenever possible.

One column, multiple values

If you’re doing this, it’s going to cause you and your project big headaches later. Putting multiple values in a single column takes away the database’s ability to validate input, enforce various types of constraints, including UNIQUE and CHECK constraints, and eliminates the possibility of performing joins on that column. All of this means more front end code. In addition, it causes you to write more code on the front end to parse the values coming back from the database!

Multiple columns, one attribute

This is almost the opposite of the above: a single table with columns called, for example “email1″ and “email2″. This is problematic because, of course, you may decide at some point that having three email addresses is just fine. In that case, you’ll have to add another column to the table, get the data in there for the people you have data for, and fill in NULLs or empty strings for the rest. It’s ugly.

Like many other data design problems, this is another one that limits your ability to perform JOIN operations to exploit relationships between various data entities. Think about our users again, and how we wanted to find user names and email addresses for each asset tag. Which email should you use? In this scenario, you’ll either do a JOIN involving only one of the email address fields, or you’ll do a SELECT for each ‘emailN’ column you have, and it only goes downhill from there, as we’re back to scalability issues and general ugliness and inelegance.

The right thing to do here is create an email lookup table that has columns for userid and email address. Now you have no more NULL data to worry about, because there are only rows in this table for users who have email addresses. Also, a side effect of this design move is that users can have an arbitrary number of email addresses, because all it means is adding another row to the table.

NOTE:
To further normalize, you could actually create an ‘email_address’ table that just maps email addresses to id’s, and then instead of having uid and email columns, you’d have uid and mailid columns. I’ve gone with the assumption here that, while a user can have any number of email addresses, each email address is generally mapped to a single user, unless your application is a mailing list manager ;-)

Nothing is NULL
Using “NOT NULL” throughout the entire database causes as many problems as it solves, IMHO. NULL is, at least, something the code can check for consistently. Without a NULL, the coder is forced to know what the default value is for each column that you have made ‘NOT NULL’. It is not typically the case that every NOT NULL field in the db has a default value of ‘ ‘, because ‘ ‘ is not valid input for columns of non-string datatypes. Fields using date, time, numeric and float datatypes all have different non-” valid default values. Further – these valid default values may differ from db to db. I’m not saying that NULL’s are a coder’s (or, indeed, a dba’s) best friend, but it allows a level of consistency that’s hard to match. If a value isn’t there, “NULL” is a universally acceptable alternative regardless of the database or datatype in question.

One entity, multiple tables This one can be a little less obvious. Let’s look at a generic “user” to illustrate. A great many applications implement security hierarchies based on the type of user. So, for example, there are admin users, guest users, and a couple of user types in between. My contention is that all of these admins and guests are still users, and therefore should all be put in a table called “users”, which is probably a much smaller table than you’re imagining, since it holds only that user data which all the various types of users have in common, which is often just “person” data like name and email address. Data that is specific to a role within the application would be kept in a lookup table, to allow for the possibility of a single user having multiple roles within the application. This problem has bitten seemingly every open source CMS in existence at one time or another.

Onward and Upward

I hope this has helped at least one person get their data in order. I’m also sure I’ve left out dozens of scenarios and common database badness that I’ve just forgotten about. Don’t even get me started on data validation at the app level – I purposely avoided app-level data handling mistakes and stuck to modeling problems because that alone is a big enough world to get lost in.

Enjoy!