Archive for the ‘Database’ Category

I have a lot of interaction with publishing types. I write a lot, and I edit some, and I do tech reviews and stuff for some publishers, and I co-authored a book, and I’ve worked on two magazines, and a newspaper, and I’m generally fascinated by the technical book market and stuff like that. I’m also someone who is lucky enough that his job is also his hobby. I work in technology, and am always doing something technology related at home in my spare time. Needless to say, I read tons upon tons of technical books.

I almost never post book reviews, in spite of the fact that I read all of these books. Why? Well, to be honest, I couldn’t tell you. It just hasn’t occurred to me to write a book review. Could be because I don’t really value book reviews too much myself I guess. I mean, if there’s a really obvious consensus across a huge number of reviews, I might be swayed. But in general, I find that book reviews are too often the target of astroturfing campaigns.

If there’s a tech book you’d like a review of that deals with things I’m generally into, let me know and I’ll post a review, if I’ve read it (or want to read it). Here are subjects I’m likely to have read books about in the past couple of years:

  • Linux, UNIX, and administration thereof
  • Python (all levels — I just read pretty much whatever is out there)
  • web 2.0 APIs (mostly Google and Amazon)
  • Any book about any service that can be run in a *x environment (DNS, Apache, DHCP, Jabber, and most other things that open a port)
  • Anything related to generic SQL, database design, or (more specifically) mysql and postgresql.
  • HPC (cluster computing)
  • Generic programming, software, computer science, or high-level systems design books
  • Digital photography (I have a Canon Digital Rebel, if that helps — I do *not* use Photoshop)
  • PHP
  • Maybe some other stuff I’m forgetting

I got to actually speak to Brian Aker for maybe a total of 5 minutes after his micro-presentation about Drizzle, which took place at the Sun booth at OSCON 2008. I was a bit nervous to ask what questions I had out loud, because the things I had wondered about were things I really didn’t see too much discussion about out in the intarweb. I’m happy to report that, if Brian Aker is to be considered any kind of authority (hint: he is), my ideas are not completely ridiculous, so maybe I’ll start talking a bit more about them.

UPDATE: lest anyone get the wrong idea, Brian Aker did, in fact, state that views are not on the short list of priority items for Drizzle, but he did say that views are one of the features he finds most useful, and that they’d probably be higher on any future priority list than, say, stored procedures. So, take my notes below about views with a grain of salt. It’s not necessarily “coming”.

My three ideas were these:

  1. Materialized views: my experience with views in MySQL is that they just plain old don’t scale well compared to other database systems I’ve used. I used Sybase in 2000 and views scaled better for me then than MySQL views do now, and I’m using them in mostly the same way (which is to say that I’m not using them to do evil things - I’m using them in the way most of the database community agrees they should be used). In the past, I thought materialized views were “nice to have”, but now that I’m working with much larger data sets, without a need for my reporting to be 100% real-time, materialized views would be great. To be honest, I win either way with Drizzle in all likelihood, as Brian Aker has proclaimed that views in Drizzle will not look like views in MySQL. He confirmed that materialized views would be a great thing to have a closer look at, and I was happy with that reply.
  2. “Query Fragments”: I didn’t know they were called query fragments. What I explained to Brian Aker was that I wanted to harvest subsets of cached result sets. So, for example, if I do a date range query (I do that a lot), and the result set is cached, and then my app does another query which is identical save for that the date range is a subset of the one in the cached result set, I’d like to grab that data from the cache. My actual question to Brian was “can this be built in such a way that this would be a reliable, trustworthy result coming from some middle tier component?” And that’s when he told me about query fragments. He said that this idea was not at all crazy, and was also worthy of further discussion.
  3. Vertical data-padding. Well… that’s what I’m calling it. Here’s the logic: I have lots of temporal data. Tons of it. My queries are largely things like “for this user, show me all of the foo’s bar’d — group by day — where day is between these two days”. MySQL is good at these kinds of queries (assuming you index well and, basically, that you’ve read “High Performance MySQL” a couple of times… per edition), but there’s something missing. When I get the result set back, any of the days for which no foo’s were bar’d, I don’t get a record. This is perfectly within the realm of reasonable behavior, but I’ve never known MySQL to let things like being reasonable get in the way of helping their users, so my suggestion was that MySQL, in order to do the comparisons for the “WHERE” clause, *MUST* know what dates fall within the dates in, say, a “BETWEEN” statement. It would then seem logical to have some way to tell MySQL “If one of those dates has no value, return a NULL” (or 0, or an empty string, or something). I don’t know the real name for this proposed feature, but I call it “vertical data-padding” because you’re padding columnar data, which, in my mind, is visualized vertically. Just like when you do a “GROUP CONCAT” or something, I would refer to that as horizontal data padding. I explained to Brian that one way I’ve seen this handled is to have a lookup table of static dates that gets joined to the main data table. You do a left outer join with the date lookup table on the left, and you get back a row for every date whether there’s data in the right-hand table or not. This works when ‘n’ is small (like everything else), but it’s hurrendous when you have, say, 10 million rows to deal with. Then you’re in what I call the “No Bueno Zone”. Brian seemed interested in the problem, and I’ll be discussing that with him further when his life settles down a bit (he’s been at OSCON, and he’s still settling in at Sun).

I want to thank Brian Aker for his enthusiastic attitude toward helping others, and for all of the work he pours into all of this stuff. I also want to say that, for me, Drizzle is really exciting, not so much because the feature set is more or less cherry picked to map onto what I do for a living, but also because it represents an opportunity to get ideas in the door before a lot of legacy cruft makes it impossible to implement these somewhat idealistic features without rocking the boat for the millions of users already launching it into production.

Launching a Startup in 3 Hours was a great talk given by Andrew Hyde (of techstars.org) and Gavin Doughtie (of Google). Both of the speakers are heavily involved in the recent trend of doing “Startup Weekends”, and techstars.org is an organization that hosts startup weekends all around the US (and I think internationally as well - Andrew mentioned one in Germany if I heard correctly).

The first half of the talk was about the general concept of a startup weekend, the problems it avoids (”we’ve been working for 9 months and haven’t launched anything”), the problems it brings up (”If you’re not using Java, you’re an idiot, so count me out!!”), and lots of details about how to organize, how to assign roles, and some common tools they use (like Basecamp and whatever your IM of choice is). There was also talk of legal issues, how (basically) to think about forming the company with the people involved, and decisions that need to be made at a business level aside from just the coding.
IMG_4514.JPG

The second half of the talk wasn’t a talk at all. Instead, people who had ideas stood up, presented their idea in a couple of sentences, and once the ideas were out there, we were told to break into groups and get to work! So people would get up and move over to the person whose idea they liked, and they’d start brainstorming. I decided to head out after about 30 minutes of observing and talking with people about ideas, but when I left, there were probably 6-8 groups of people engrossed in conversations, and the energy level was very high. Overall, it was a really exciting experience!

The evening plans didn’t wait for talks to be done. The IRC channel (#oscon on irc.freenode.net) was alive with talk of prospects for dinner and drinks after the conference. I myself was torn between a group going out for Lebanese and another going to Henry’s, but opted to go with my buddies from home to Henry’s.

It was worth it. If you haven’t been, Henry’s Tavern boasts 100 beers and hard ciders on tap (oddly, the beer list is the only menu *not* online - guess it changes too frequently). There are a ton of local beers that you can’t even get on the east coast just waiting for you to try, but there are also some rare treats, like the Belgian Lambic beers, which you don’t often see on tap. The food is a little pricey, but is really good, and the staff is very friendly. IMG_4491.JPGA couple of us were in a rush to get back by 7 for the BoF sessions, and when we asked the waittress how easy it was to catch a cab, she immediately informed us that she would have the hostess call one for us. About 2 minutes later we were in a cab on our way back (we wouldn’t have made it back in time if we had to walk back to catch the light rail).

I was not one of those rushing to a BoF, so I did a little poking around the area near the convention center. It was getting dark, and I didn’t want to stray too far, but I did find a couple of points of interest. First, there’s a bank right across the street from the convention center. I’d be willing to bet that the ATM there is less than the $3 the ATM inside the center charges.
IMG_4501.JPG
Beyond that is a paintball place. It was closed by the time I found it, and I don’t know if they run every day, or anything else, but interested parties might find it open during the lunch breaks or something if you wanted to check it out. The paintball place is located behind a building that is directly across the street from the conv. center. If you see the bank, it’s on the other side of the side street the bank sits on.

Tonight appears to be low-key from what I can tell. There’s currently no chatter on irc, the hotel bar had a few people chatting, and I might go down to catch the rush of people as they return from dinner and BoF sessions. Stay tuned tomorrow for more!

I think I have pictures of most of the basic parts of the conference at my OSCON Flickr set, and I thoroughly enjoyed day 1 of the conference. Of course, while *day* 1 is over, *night* 1 has yet to even begin. There are lots of BoF sessions, and maybe even more smaller meetups going on, as smaller groups take to discussing things over dinner and a beer or three.

I have to say, that I occasionally pop into irc channels for conferences I’m not even at and follow up on that because I’m involved a bit in conference planning as part of my work with Python Magazine (I’m helping to organize the PyWorks conference in November). This conference seems to have a pretty happy audience, if IRC chatter is any indication (and it usually is). Sure, there are a couple of weak spots in the wireless network, there are some fuzzy projectors, and there was a little confusion regarding breakfast this morning, but the important bits have been well-covered by the OSCON organizers and the “boots on the ground” here on site. Kudos to them all.

This afternoon I hopped to a couple of different talks: one on Memcached and MySQL, and the other on A/B testing. Both contained good content. Of course, I’m a systems guy primarily, so I sort of wanted more of an overview of memcached from the point of view of an admin who is deploying it rather than a developer implementing their code around it. I still got plenty of value out of that talk, and this *is* really more of an open source *developer’s* conference, so the expectations of 99% of the people in the room were met, I’m sure.

A/B testing is just not an exciting topic, and I would imagine that peoples’ bosses made them go to that talk whether they liked it or not. Not to say the talk wasn’t good - the parts I saw (I came in after the break) were good, and I learned from it, and that was the goal. If you’re a QA/QC person, I’m sure the talk was riveting, and there were a lot of good ideas and things I’d never considered flying by in the slides.

Overall, Day 1 is a win. I’ll cover more about this evening’s events in the pre-breakfast hours tomorrow. Stay tuned!

I got an early start. Too early. But I’m from the west coast, so my body thinks I slept in. I wandered around a bit, took a few pics which you can see at my Flickr OSCON set, and I discovered a couple of things that might be of interest:

  • The starbucks in the conference center charges over $2 for a small cup of joe. There’s a starbucks right across the street (you can see it from the breakfast area - seriously, it’s 5 seconds away), and they charge less than $2 for a medium (grande). That’s less than I pay at home.
  • The ATM outside the starbucks charges $3 for cash. I’ll report back when I find a cheaper one, but most places seem to take plastic here.
  • Every computer involved in this conference, from registration to the video screens that dot the common areas, are running Windows XP. Just sayin’.
  • The light rail system is free to go just about anywhere except for the airport, so there’s no excuse not to get out and see Portland and take in the food and beer and stuff.
  • For beer-lovers, not only is there the Oregon Brewers Festival starting at the tail end of this conference, but there’s apparently another festival that we missed *last* weekend!! Keep that in mind when you’re planning to come to OSCON next year.

Having a Google account is sometimes useful in ways you hadn’t planned for. For example, at a few different employers I’ve been at, I’ve had to prepare for reviews by providing a list of accomplishments to my supervisor. One decent tool for generating this list is email, though it can take some time. Another useful tool is the Web History feature of your Google account.

Though this isn’t necessarily indicative of everything I’ve accomplished in the first half of 2008 per se, it’s definitely indicative of the types of things I’ve generally been into so far this year, and it’s interesting to look back. What does your Web History say?

  • Gearman - this is used by some rather large web sites, notably Digg. It reminds me a little of having Torque and Maui, but geared toward more general-purpose applications. In fact, it was never clear to me that PBS/Maui couldn’t actually do this, but I didn’t get far enough into Gearman to really say that authoritatively.
  • How SimpleDB Differs from a Relational Database - Links off to some very useful takes on the “cloud” databases, which are truly fascinating creatures, but have a vastly different data management philosophy from the relational model we’re all used to.
  • Reblog - I found this in the footer of someone’s blog post. It’s kinda neat, but to be honest, I think you can do similar stuff using the Flock browser.
  • Google Finance APIs and Tools - did I ever mention that I had a Series 7 & 63 license two months after my 20th birthday? I love anything that I can think for very long periods of time about, where there’s lots and lots and LOTS of data to play with, where you can make correlations and answer questions nobody even thought to ask. Of course, soon after finding this page I found the actual Google Finance page, which answers an awful lot of potential questions. The stock screener is actually what I was looking to write myself, but with the data freely available, I’m sure it won’t be long before I find something else fun to do with it. I’m not a fan of Google’s “Feeds” model, but I’ve dealt with it before, and will do it again if it means getting at this data.
  • Bitpusher - it was recommended to me as an alternative to traditional dedicated server hosting. Worth a look.
  • S3 Firefox Organizer - This is a firefox plugin that provides an interface that looks a lot like an FTP GUI or something, but allows you to move files to and from “buckets” in Amazon’s S3 service.
  • Boto - A python library for writing programs that interact with the various Amazon Web Services. It’s not particularly well-documented, and it has a few quirks, but it is useful.
  • OmniGraffle - A Visio replacement for Apple OS X. I like it a lot better than Visio, actually. It has tons of contributed templates. You shouldn’t have any trouble making the switch. A little pricey, but I plunked down the cash, and have not been disappointed.
  • The Python Queue Module according to Doug - Doug Hellmann’s Python Module of the Week (PyMOTW) should be published in dead tree form some day. I happen to have some code that could make better use of queuing if it were a) written in Python, and b) used the Queue module. I was a little put off by the fact that every single tutorial I found on this module assumed you wanted to use threading, which I actually don’t, because I’m not smart enough…. though the last person I told that to said something to the effect of “the fact that you believe that means you’re smart enough”. Heh.
  • MySQL GROUP modifiers - turns out this isn’t what I needed for the problem I was trying to solve, but the “WITH ROLLUP” feature was new to me at the time I found it, and it’s kinda cool.
  • Wordpress “Subscribe to Comments” plugin - Baron suggested that it would be good to have this, and I had honestly not even thought about it. But looking around, this is the only plugin of its kind that I found, and it’s only tested up to WP 2.3x, and I’m on 2.5x. This is precisely why I hate plugins (as an end user, anyway. Loghetti supports plugins) ;-)
  • Lifeblogging - I had occasion to go back and flip through some of the volumes of journals I’ve kept since age 12, wondering if it might be time to digitize those in some form. I might digitize them, but they will *not* be public I don’t think. Way too embarrassing.
  • ldapmodrdn - for a buddy who hasn’t yet found all of the openldap command line tools. You can’t use ‘ldapmodify’ (to my knowledge) to *rename* an entry.
  • Django graphs - I haven’t yet tried this, because I’m still trying to learn Django in what little spare time I have, but it looks like there’s at least some effort towards this out there in the community. I have yet to see a newspaper that doesn’t have graphs *somewhere* (finance, sports, weather…), so I’m surprised Django doesn’t have something like this built-in.
  • URL Decode UDF for MySQL - I’ve used this. It works really well.
  • Erlang - hey, I’m game for anything. If I weren’t, I’d still be writing all of my code in Perl.
  • The difference between %iowait in sar and %util in iostat - I use both tools, and wanted the clarification because I was writing some graphing code in Python (using Timeplot, which rocks, by the way), and stumbled upon the question. Google to the rescue!
  • OSCON ‘08 - I’m going. Are you going? I’m also going to the Oregon Brewers Festival on the last day of OSCON, as I did in ‘06. Wonderful!
  • Explosion at one of my hosting providers - didn’t affect me, but… wow!
  • hypertable - *sigh* someday…when there’s time…
  • Small-scale hydro power - Yeah, I’m kind of a DIYer at heart. I do some woodworking, all my own plumbing, painting, flooring, I brew my own beer, I cook, I collect rain in big barrels, power sprinklers using pool runoff to give my lawn a jumpstart in spring… that kind of stuff. One day I noticed water coming out of a downspout fast enough to leap over one of my rain barrels and thought there must be some way to harness that power. Sadly, there really isn’t, so I did some research. It’s non-trivial.
  • You bet your garden - I also do my own gardening and related experiments.
  • RightScale Demo - WATCH YOUR VOLUME - a screencast showing off RightScale’s features. Impressive considering the work it would take me, a lone admin, to set something like this up. The learning curve involved in effectively/efficiently managing/scaling/monitoring/troubleshooting EC2 is non-trivial.
  • Homebrew Kegerator - Maybe if this startup is bought out I can actually afford this thing to put my homebrewed beer in. The 30-year-old spare fridge in the basement is getting a little… gamey.
  • The pound proxy daemon - I use this. It works well enough, but I’ve crashed it under load, too. I’ve also had at least one hosting provider misconfigure it on my behalf, and I had to go and tell them how to fix it :-/
  • Droid Sans Mono - a fantastic coding font. Installing this font is in my post-install routine for all of my desktops.
  • Generator tricks for systems programmers - David Beazley has made available a lot of Python source code and presentation slides from what I imagine was a great talk (if you’re a systems guy, which I am).
  • The Wide Finder Saga - I found this just as I was writing Loghetti. There are still some things in Mr. Lundh’s code that I haven’t implemented, but it was a fantastic lesson.
  • Using gnu sort for IP addresses - I’ve used sort in a lot of different ways over the years… but not for IP addresses. This is a nice hack for pulling this off with sort, but it doesn’t scale very well when you have millions of them, due to the sort utility’s ‘divide and conquer’ method of sorting.
  • Writing an Hadoop/MapReduce Program in Python - this got me over the hump.
  • Notes on using EC2/S3 - This got me over some other small humps
  • BeautifulSoup - found while searching for the canonical way to screen scrape with Python. I’d done it a million times in Perl, and you can do it with httplib and regex and stuff in Python if you want, but this way is at least a million times nicer.

Well, that’s a decent enough summary I guess. As you can see, I’ve been doing a good bit of Python scripting. Most of my code these days is written in Python instead of Perl, in part because I was given the choice, and in part because Python fits my brain and makes me want to write more code, to push myself more. I’ve also been dealing with things involving “cloud” computing and “scalability” — like Hadoop, and EC2/S3. I haven’t done as much testing of the Google utility computing services, but I’ve used their various APIs for some things.

So what’s in your history?

I’ve been working with what I used to call “utility computing” tools for about 6-9 months. However, for about the past 2 months, I’ve been seeing the term “cloud computing” all over the place, and there is so much buzz surrounding it that it’s reaching that magical point best described using Alan Greenspan’s words: “Irrational Exuberance”.

When Alan Greenspan used those words to describe the attitudes of investors toward the markets, what he was basically saying was that there were people who didn’t really know what they were doing, putting more money than they ought, into things they knew relatively little about. Further, he was saying that the decisions people were making with regards to where to put their money were a) bad, or at least b) not based on sound reasoning, or the ‘facts on the ground’.

This, I think, is where we are at with “cloud computing”. The blog post that put me over the edge is this one, for the record. I read Sean’s writings often enough, but this one strikes me as being a little off, a little sensationalistic, not based in reality, and a little misleading.

Maybe he just didn’t put enough qualifiers in there. His post might make more sense if he limited its scope and provided more facts, but I guess it’s just an opinion piece so he decided not to go that route, and that’s his prerogative I guess.

By limiting the scope, I mean he should’ve realized that there are millions of web sites currently scaling quite nicely without the use of cloud computing. In addition, some of the new ones that are having issues are also not using cloud computing, and when they hit bumps in the road, they make it through, and the great thing is that they also share their stories, and those stories indicate that a cloud (or, the current cloud offerings) wouldn’t have helped much (there’s lots of other evidence of that too). What would’ve helped is if they had paid more attention to:

  • monitoring
  • initial infrastructure design
  • their own app code and app design
These aren’t issues that cloud computing takes away. What’s more, cloud computing is something of a moving target, many of the solutions aren’t as mature as you’d want them to be if you’re betting the house on them (EC2 only recently got “elastic IPs” and persistent storage is still not there, AppEngine only supports Python and has some rather severe limitations on functionality of your app), and they introduce a potentially large learning curve both in terms of how the individual services work, as well as how the heck to make your app fit into the cloud solution of your choosing. Think SimpleDB scales? Well, it does, but it’s also not a relational database, and doesn’t guarantee…. much of anything, including data integrity. You can’t interface with it using the drivers, interfaces, and language you’re used to using, either, because it’s not just a mysql wrapper or something - it’s a new beast entirely. Enjoy!
This is not to mention, of course, that some people have absolutely no choice but to scale without the help of the cloud, because corporate policy, common sense, or other forces mean that they can’t have their data passing through non-corporate-owned machines and/or networks. Also, Sean omits any mention of the cost factor, which is often a huge driver in getting startups to use these services, but may not really make the move “worth it” in some cases.
Anyway, in short, all I’m really saying is that it’s disingenuous to say that the future of web computing is “the cloud” because “only the cloud can scale”. That’s just silly. Non-cloud infrastructures can scale fine depending on the balance between the demands of the application and the funds available. The future of web computing will probably involve shared, utility computing architectures, but the future doesn’t depend on cloud computing.

If there’s one thing that bothers me about using a ready-made solution like wordpress for my blog, it’s plug-ins. I hate software plug-ins. The first question every support engineer for any software product that supports plugins asks in response to a trouble report is “are you using any plugins?” And when you say “yep, I’m using plugins!” the reply from support is to disable them immediately and see if the trouble goes away. That’s a problem.

What’s worse, if the plugins are maintained by a third party (often the case), there’s no telling whether or not they’ll exist when the next version of the base software is released, or whether they’ll be supported in future versions of the software.

Two examples that touch my daily life are Firefox, and Wordpress.

Lately (since around March) I’ve been having lots of trouble with Firefox. I thought upgrading to Firefox 3 would’ve helped, but it really didn’t. Running it on OS X, Firefox hangs frequently enough that I’m actually considering using Safari (I do NOT like Safari). Know what happened right around that time? Ah - I found the firefox plugins for managing EC2 and S3. So today I’ll uninstall those and see if it helps.

With Wordpress, there are two things I’m missing: I need to let readers subscribe to comments via email, and I need better Google AdSense for Search integration with WordPress. Both things are kinda maybe supported in one version or another “but should work under…” - whatever. I don’t really want to spend my time downloading, reading the documentation to do the install, doing the install and configuration, etc., and then finding out that it doesn’t work, or worse, having it look on the surface like it works, but then finding later that it fails in evil-but-silent ways.

These two products are by no means exceptions. Moodle, PHP-Nuke, XOOPS, MediaWiki, Twiki, Postnuke… and for that matter, OpenLDAP, BIND, SSH, MySQL, Sendmail, PAM… all have plugins available written by other folks, and all have bitten me at one point or another. Usually when it comes time to upgrade the base software.

I’m not saying anything new here. People have had this problem with lots of different software products for a long time. My question is “why is this still a problem?” I’m not asking this because I have some magical obvious solution or answer, I’m asking because I feel like there’s probably more to it than I’m grasping. I’m not a masterful developer, or even a masterful software project manager, so I’m calling on all of you who are (or are closer than I am) to help me understand the problem. Some day, I might find myself in a position to take the wrong or right path where plug-ins are concerned, and I’d like to be more informed than I am so I can avoid putting users in the position I find myself in when I use other peoples’ software. Has Joel blogged this yet? If so, I can’t find it. Links please?

I was writing a utility in Python (using boto) to test/play with Amazon’s SQS service. As boto isn’t particularly well documented where SQS specifically is concerned, I also plan to post some examples (either here or on Linuxlaboratory.org, or both). When I had some trouble getting a message that was sent to a queue, I went to the Amazon documentation, and found this little gem in the Amazon Web Services FAQ

I am sure that my queue has messages, but a call to ReceiveMessage returned none. What could be the problem?

Due to the distributed nature of the queue, a weighted random set of machines is sampled on a ReceiveMessage call. That means only the messages on the sampled machines are returned. If the number of messages in the queue is small (less than 1000), it is likely you will get fewer messages than you requested. If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response. Your application should be prepared to poll the queue until a message is received. Note that with the 2008-01-01 version of Amazon SQS, you’re charged for each request you make, so set your polling frequency with that in mind.

So… if you were planning to decouple application components using SQS using an ‘eventual consistency’ model, keep in mind that they’re using the same model, and that they’re charging you for the privilege of eventually getting the messages you’ve already paid to put there, but aren’t necessarily available at any given point in time. I personally think this is a little goofy, and wrong.

If I put a message in a queue, I should be charged for actually getting the message. I should *not* be charged for checking to see if Amazon’s internal workings have made my messages available to me yet.