Scalability Best Practices: eBay

Following a link from the High Scalability blog, I found this really great article about scalability practices, as told by Randy Shoup at eBay. Randy is very good at explaining some of the more technical aspects in more or less plain English, and it even helped me find some wording I was looking for to help me explain the notion (and benefits) of functional partitioning. He also covers ideas that apply directly to your application code, your database architecture (including a little insight into their sharding strategy), and more. Even more about eBay’s architecture can be found here.

Make old svn revision the current revision

I ran across an issue that my google-foo has had some trouble handling. Maybe what I did is the only way to do it, in which case maybe this will help someone in need… but I rather like to think that someone here will have a much nicer solution.

I use Subversion at most of the clients I work for, for most of the projects I work on. It’s even used in the production/editorial process at both Python Magazine and php|architect Magazine. For my own work, I’m still using it personally mainly because I haven’t had the time to really get a good grip on Git yet.

Though I’ve been using svn almost since it was created, I can’t find a clean way to revert to an earlier revision of a file, and then make that version of the file the new “current” version of the file. At first I thought I could just make a trivial change to the file and commit, but you can’t commit if the file is out of date with the repository (that’s why you’re supposed to run ‘svn up’ before committing – duh). I also didn’t see an option to any of the svn commands that looked like it would work, but maybe I missed something (do tell!).

What I wound up doing was moving the local copy of the file sideways, svn up’ing the file back down, moving the good copy back over (overwriting the one I just svn up’d), and committing. Nasty. It looks like this:

$ mv myfile.php myfile.php.bak
$ svn up myfile.php
$ mv myfile.php.bak myfile.php
$ svn commit myfile.php -m "All cleaned up"

It works, but it just seems like something that should be built into svn somehow. Those with sufficient clue are hereby solicited to share it :-)

Multisourced Production Infrastructure: History, and a stab at the Future

Startups are pretty fascinating. I work for a startup, and one of my good friends works for another startup. I’ve also worked for 2 other startups, one during the first “bubble”, and another one a few years later. Oh my, how the world of web startups has changed in that time!

1999: You must have funding

The first startup I was ever involved in was a web startup. It was an online retailer. They were starting from nothing. My friend (a former coworker from an earlier job) had saved for years to get this idea off the ground. He was able to get a few servers, some PCs for the developers he hired, and he got the cheapest office space in all of NYC (but it still managed to be a really cool space, in a way that only NYC can pull off), and he hosted every single service required to run the web site in-house. If I recall correctly, he had a web and database server on one machine, and I believe the primary DNS server was on an old desktop machine he had laying around the house. This gave him the ability to build the completely, 100%-functional prototype, and use it to shop for funding.

It worked. They got funding, they bought more and bigger servers. They got UPSes for them (yay!), they got more cooling, a nicer office, and they launched the site, pretty much exactly as the prototype existed, and things were pretty stable. Unfortunately, the VCs who took seats on the board after the first round of financing didn’t understand the notion of “The Long Tail”, so they eventually went under, but that’s not the point.

The point is, that was 8 or 9 years ago. It costed him quite a good bit of his hard-earned savings just to get to a place where he could build a prototype. A prototype! He only really knew Microsoft products, and buying licenses for Microsoft SQL Server, and the developer’s tools (I forgot what they were using as an IDE, but they were a ColdFusion shop) was quite a chunk of money. My friend really only had enough money to put together a prototype, and they were playing “beat the clock” — trying to get a prototype done, and shop for (and get) funding, before the money ran out, because they couldn’t afford the hardware, power, cooling, big-boy internet connection, and the rest of what goes into a production infrastructure. The Prototype->VC->Production methodology was pretty typical at the time.

2003: Generate Some Revenue

In 2003, a couple of years after the bubble burst, I was involved in another startup. This one was 100% self funded, but has been rather successful since. By this time, dedicated hosting was just affordable enough that it was doable for a startup that had some revenue being generated, and that’s what my friend did. He also outsourced DNS completely (through his registrar, if memory serves), but he still hosted his own email, backup, and some other services in-house. He had plenty of hiccups and outages in the first year, but overall it ran pretty well considering all of the things he *didn’t* have to be concerned with, like power, cooling, internet uplinks, cable trays, etc. The world was becoming a friendlier place for startups.

2008: Do it all, on the cheap

Nowadays, the world is a completely different place for startups, and a lot of this is due to the rich set of free (or very cheap) resources available to startups that make it possible for them to do a production launch without the VC funding that used to be required just to get the hardware purchased.

In 2008 you can outsource DNS for relatively little money, and it’ll have the benefit of being globally distributed and redundant beyond what you’re likely to build yourself. You can get Google Apps to host your email and share calendars and documents. You can store backups on Amazon’s S3. You can use something like Eclipse, Komodo Edit, or a number of language-specific environments like Wing IDE or Zend Studio to do “real development” (whatever your definition of that is) for free or relatively cheap. You can also get a free database that is reasonably well-stocked with features and capabilities, a free web server that runs 65%+ of the internet sites in existence, and if you have the know-how (or can get it), you can actually host anything you want, including your entire production infrastructure (within reason, and subject to some caveats) on Amazon’s EC2, for a cost which is tied to what you use, which is cheaper in a lot of cases than either buying or leasing a dedicated server. Multisourcing has arrived!

In looking at this progression from “you must have funding”, to “you’re going to need to generate a little revenue”, to “do it all, on the cheap”, the really obvious question this all raises is:

“Now what?”

Well, this whole 2008 situation is making things better, but… how do I put this… “It’s not soup yet”.

First of all, there is no single platform where you can realistically do everything. Google’s AppEngine is nice, but it has its limitations, for example, you don’t have any control over the web servers that run it, so you can’t, say, add an Apache mod_rewrite rule, or use a 301 redirect, or process your log files, etc. Troubleshooting this application based solely on input from people who are having issues with your app would be difficult.

Amazon’s service gives you far more control, and if you need it, that’s great, but it completely changes how you architect a solution. I think that some of these things are good changes, and are things we should all be thinking about anyway, but Amazon forces you to make decisions about how to plan for failure from the moment you decide to go this route — even if it’s for prototyping, because until persistent storage on EC2 is a reality available to the entire user base, whenever an EC2 instance disappears, so does every bit of the data you added to it. You’ll have to start from scratch when you bring up another instance. You’re also going to have to add more scripts and utilities to your toolbelt to manage the instances. What happen when one disappears? How do you fail over to another instance? How can you migrate an IP address to the running instance from the failed one? How you do all of these things, in addition to just building and installing systems, is different, and that means learning curve, and that means overhead in the form of time (and expense, since you have to pay for the Amazon services to use them to learn on).

There are also now “grid” solutions that virtualize and abstract all of your infrastructure, but give you familiar interfaces through which to manage them. One that I’ve used with some success is AppLogic, but other services like GoGrid and MediaTemple have offerings that emphasize different aspects of this niche “Infrastructure-as-a-service” market. Choose very carefully, and really think about what you’ll want to do with your infrastructure, how you want to manage it, monitor it, in addition to how you’ll deliver your application, and also think about how you’ll be flexible within the confines of a grid solution before you commit, because the gotchas can be kind of big and hairy.

None of these are whole solutions. However, any of them could, potentially, some day, become what we would now call a “perfect solution”. But it still wouldn’t be perfect in the eyes of the people who are building and deploying applications that are having to scale into realms known seemingly only inside some brain vault that says “Google” on it. What those of us outside of that vault would like is not only Google-like scalability, but:

  • global distribution, without having to pledge our souls in exchange for Akamai services. It’s great that I can build an infrastructure on EC2 or GoGrid, but I’d like to deploy it to 10 different geographic locations, but still control it centrally.
  • the ability to tightly integrate things like caching distribution network services with the rest of our infrastructure (because CDNs are great at serving, but not so much at metrics)
  • SAN-like (not NFS-like) access to all storage from any/all systems, without sacrificing the IO performance needed to scale a database properly.
  • As an admin, I want access to all logs from all services I outsource, no matter who hosts it. I don’t believe I can access, for example, our Google Apps logs, but maybe I’ve forgotten to click a tab somewhere.
  • A *RELATIONAL* database that scales like BigTable or SimpleDB

There’s more to it than this, even, but I’ve stopped short to make a point that needs making. Namely, that these are hard problems. These are problems that PhD candidates in computer science departments do research on. I understand that. The database issue is one that is of particular interest to me, and which I think is one of the hardest issues (not only because of its relationship to the storage issue, by the way). Data in the cloud, for the masses, as we’ve seen, involves making quite a few rather grandiose assumptions about how your schema looks. Since that’s not realistic, the alternative is to flatten the look of the data, and take a lot of the variables out of the equation, so they don’t have to make *any* assumptions about how you’ll use/organize the data. “Let’s make it not matter”. Genius, even if it causes me pain. But I digress…

The idea here is just to give some people a clue what direction (I think) people are headed in.

These are also very low-level wants. At a much, much, much higher level, I’d like to see one main, major thing happen with all of these services:

  • Get systems administrators involved in defining how these things are done

I’m not saying that because I want everything to stay the same and think a system administrator will be my voice in that or something. I do *NOT* want things to stay the same, believe me. I’m saying it because it seems pretty obvious to me that the people putting these things together are engineers, and not systems administrators. Engineers are the people you call when you want to figure out how to make something that is currently 16GB fit into 32MB of RAM. They are not the people you call when you want to provide a service/interface/grid/offering/whatever that allows systems folks to build what amounts to a company’s IT infrastructure on a grid/instance/App/whatever.

Here’s a couple of examples:

When I first launched an AppLogic grid, a couple of things jumped out at me. The partitions on the components I launched were 90% full upon first boot, they had no swap partition, and there was no consistency between OS builds, so you can’t assume that a fix on one machine can be blown out via dsh or clusterssh to the rest. The components were designed to be as small as possible, so as to use as little of the user’s alotted resources as possible. In addition, mount points created in the GUI management interface and then mapped to a component… don’t cause the component to have any clue what you just did, which raises the question “umm… why did I bother using the GUI to map this thing to this component if I just have to edit /etc/fstab and mount it in the running instance myself anyway? Back to consistency, this is unlike if you had, say, allocated more RAM or storage, or defined a new network interface on the device in the GUI.

There is no part of EC2 or S3 that looks like a sysadmin was involved in that. It’s a programmer’s platform, from what I can tell. For programmers, by programmers. Luckily, I have enough background in programming that I kind of “get it”, but while I might be able to convince myself that there are parallels between how I approach programming and building infrastructures, it still represents a non-trivial context switch for me to move from working deeply at one to working deeply at the other, so mixing the two as a necessity for moving forward is less than nice.

There is no “database in the cloud” service that looks remotely like there was a database systems person involved at all, that I can tell. I’ll confess to not having used BigTable or SimpleDB, but the reason is because I can’t figure out how to make it useful to me at the moment. These tools are not relational, and my data, though it’s been somewhat denormalized for scaling purposes (compromises taken carefully and begrudgingly – I’d rather change database products, but it’s not in the cards), is nonetheless relational. I’ve considered all kinds of object schemes for data in the past, and I still think that there’s some data for which that can work well, but it’s not currently a solution for me. Once you look at the overhead in managing something like EC2, S3, AppLogic, etc., the very last thing you need is the overhead of a changing data storage/manipulation paradigm.

Should I be hiring systems folks, or developers? Both? Ugh. Just when I thought you could finally get away with building a startup with nothing more than an idea, a sysadmin and a coder, here they go roping me back into hiring a team of developers… to manage the systems… and the data. No good (and I mean *NO GOOD*) can come of developers managing data. I know, I’ve seen ‘em do it.

All of that said, I use all of this stuff. Multisourcing is here to stay – at least until someone figures a whole bunch of stuff out to make unisourcing a viable alternative for systems folks, or they collectively redefine what a “systems person” is, which is an extremely real possibility, but is probably quite a ways off. My $.02. Flame at will ;-)

Social Media, The Future of News, and Data Mining

I went to a very good panel discussion yesterday hosted by the Center for Information Technology Policy at Princeton University. There has been a conference going on there that covers a lot of the overlap between technology, law, and journalism, and the panel discussion yesterday, Data Mining, Visualization, and Interactivity was even more enlightening than I had anticipated.

The panel members included Matt Hurst, of Microsoft Live Labs, Kevin Anderson, blog editor for The Guardian, and David Blei, a professor at the Computer Science Dept., Princeton University. This made for a very lively discussion, covering a wide range of perspectives about social media, “what is news?”, how technology is changing how people interact with information (including news), how the news game is changing as a result (which was far more fascinating than it sounds), and how this unfathomably enormous stream of bits, enabled by lots of open APIs, feeds, and other data streams can be managed, mined, reduced, and presented in some value-added way (part of the value being the sheer reduction in noise).

Cool Tools for Finding News

Some of the tools presented by the panelists were new to me, and aside from being great tools for bloggers and other content publishers, there are some excellent examples of how to make effective use of the data you have access to through APIs like the Digg API.

BlogPulse

This was presented by Matt Hurst. It’s is pretty neat – it’s a tool that essentially charts blog buzz of a given phrase over time, and it even lets you compare multiple phrases, which is really interesting as well. Check it out here.

I’d like to know more about how it derives the metrics, but in doing a couple of quick comparisons using the tool, it seems to line up to some degree with simple comparisons of the number of search results for different phrases on sites like technorati and bloglines. Interestingly, even though there appears to be lots more data available at Technorati, in my very limited experimenting, the percent difference between search results for any two phrases appears to be similar, indicating that bloglines may be a representative sampling of technorati data. More experimentation, of course, would be needed to lend any credibility whatsoever to that claim. It’s probably irrelevant, because you can’t ask either service for any kind of historical data regarding search results :)

Twistori

This has the potential to be really interesting. Right now, it lets you pick from several different terms, like “love”, “wish”, “think” and “feel”, and after clicking one of those, it’ll start producing a constantly updating stream of twitters that contain those words. If this experiment is successful, I would imagine they’d eventually enable the same service for arbitrary keywords, which would be really powerful, and quite a lot of fun!

Tweetwheel

Oh how boring my life according to twitter is. I’m still in the schizophrenic stage of settling on a live ‘update your friends on what you’re doing whether they care or not’ services. Facebook, myspace, twitter, jaiku… there are too many. I’m trying out the imified route now to consolidate all the cruft. According to tweetwheel, there are more places to update my status at any given moment than there are people who give a damn what my status is.

Anyway, tweetwheel shows how you’re connected to people through twitter. If you have lots of followers and follow lots of people, the wheel is really exciting to look at, as displayed by Kevin Anderson, who has a much more “robust” wheel than me — it’s actually interesting to look at. At some point I’d like to see this idea expanded to cover the other services like Facebook and even LinkedIn.

Digg Labs

You have to go to the Digg Labs site and see what people are doing with the Digg API. There are too many awesome utlities to cover them all here. It almost makes me wish I did fancy Flash UI stuff instead of back end data mining and infrastructure administration.

At a higher level…

Most of the discussion about social media seems to be about measuring buzz created by bloggers (at least where news/content publishing is concerned). However, although things have shifted dramatically in a ‘consumers are producers’ direction, causing people to start rethinking the definition of news, this shift is caused as much by consumers who are still *only* consuming as anyone else, and I didn’t see much in the way of tools that measure the interest of those people in any meaningful way. Perhaps the consensus is that the bloggers are a representative sampling of the wider internet readership? I don’t know. I would disagree with that if it were the case.

I work for AddThis.com, which seeks to provide publishers of news and all kinds of other content with statistics that help them figure out not just what pages people happen to be landing on, but which ones they have elected to take a greater interest in, either by emailing it to a friend, adding it to their favorites, or posting it to digg, delicious, or some other service. Maybe some day there will be an AddThis API that’ll let you easily do even more interesting things with social media.

Get a List of Years and Months in a Date Range with PHP

I can’t find the initial inspiration for this to link to – sorry. I came across a forum post or maybe a blog post where someone needed to do something like return an array of the months (or maybe it was days or something) in a given date range. I had a place where I could’ve potentially used a similar bit of code, so I wrote this, and when I was done, I realized it was much nicer than the code I had started with, but lost the link to go back and post the code there. In the event that this is useful to someone, I’ll post it here :)

UPDATE: Don’t ask me why my WP SyntaxHighlighter plugin refused to preserve the indentation in this code. I haven’t a clue. If you do, post a comment to share the clue.


<?php

function get_months($startstring, $endstring)
{
$time1  = strtotime($startstring);//absolute date comparison needs to be done here, because PHP doesn't do date comparisons
$time2  = strtotime($endstring);
$my1     = date('mY', $time1); //need these to compare dates at 'month' granularity
$my2    = date('mY', $time2);
$year1 = date('Y', $time1);
$year2 = date('Y', $time2);
$years = range($year1, $year2);

foreach($years as $year)
{
$months[$year] = array();
while($time1 < $time2)
{
if(date('Y',$time1) == $year)
{
$months[$year][] = date('m', $time1);
$time1 = strtotime(date('Y-m-d', $time1).' +1 month');
}
else
{
break;
}
}
continue;
}

return $months;
}

?>

And here’s some sample code to make use of it:


<?php

$montharr = get_months('2003-01-04', '2005-09-18');
foreach(array_keys($montharr) as $year)
{
foreach($montharr[$year] as $month)
{
print "{$year}-{$month}\n";
}
}

?>

A Couple of MySQL Performance Tips

If you’re an advanced MySQL person, you might already know these, in which case, please read anyway, because I still have some questions. On the other hand, f you’re someone who launched an application without a lot of database background, thinking “MySQL Just Works”, you’ll eventually figure out that it doesn’t, and in that case, maybe these tips will be of some use. Note that I’m speaking specifically about InnoDB and MyISAM, since this is where most of my experience is. Feel free to add more to this content in the comment area. 

InnoDB vs. MyISAM

Which one to use really depends on the application, how you’re deploying MySQL, your plans for growth, and several other things. The very high-level general rule you’ll see touted on the internet is “lots of reads, use MyISAM; lots of writes, use InnoDB”, but this is really an oversimplification. Know your application, and know your data.  If all of your writes are *inserts* (as opposed to updates or deletes), MyISAM allows for concurrent inserts, so if you’re already using MyISAM and 90% of your writes are inserts, it’s not necessarily true that InnoDB will be a big win, even if those inserts make up 50% of the database activity

In reality, even knowing your application and your data isn’t enough. You also need to know your system, and how MySQL (and its various engines) use your system’s resources. If you’re using MyISAM, and you’re starting to be squeezed for disk space, I would not recommend moving to InnoDB. InnoDB will tend to take up more space on disk for the same database, and the If you’re squeezed for RAM, I would also not move to InnoDB, because, while clustered indexes are a big win for a lot of application scenarios, it causes data to be stored along with the index, causing it to take up more space in RAM (when it is being cached in RAM).

In short, there are a lot of things to consider before making the final decision. Don’t look to benchmarks for much in the way of help — they’re performed in “lab” environments and do not necessarily model the real world, and almost certainly aren’t modeled after your specific application. That said, reading about benchmarks and what might cause one engine to perform better than another given a certain set of circumstances is a great way to learn, in a generic sort of way, about the engines.

Indexing

Indexes are strongly tied to performance. The wrong indexing strategy can cause straight selects on tables with relatively few rows to take an inordinately long amount of time to complete. The right indexing strategy can help you keep your application ‘up to speed’ even as data grows. But there’s a lot more to the story, and blind navigation through the maze of options when it comes to indexing is likely to result in poorer performance, not better. For example, indexing all of the columns in a table in various schemes all at once is likely to hurt overall performance, but at the same time, depending on the application needs, the size of the table, and the operations that need to be performed on it, there could be an argument for doing just that!

You should know that indexes (at least in MySQL) come in two main flavors: clustered, and non-clustered (there are other attributes like ‘hashed’, etc that can be applied to indexes, but let’s keep it simple for now). MyISAM uses non-clustered indexes. This can be good or bad depending on your needs. InnoDB uses clustered indexes, which can also be good or bad depending on your needs.

Non-clustered indexing generally means that the index consists of a key, and a pointer to the data the key represents. I said “generally” – I don’t know the really low-level details of how MySQL deals with its non-clustered indexes, but everything I’ve read leads me to believe it’s not much different from Sybase and MSSQL, which do essentially the same thing. The result of this setup is that doing a query based on an index is still a two-step operation for the database engine: it has to scan the index for the values in the index, and then grab the pointer to get at the data the key represents. If that data is being grabbed from disk (as opposed to memory), then the disk seeks will fall into the category of “random I/O”. In other words, even though the index values are stored in order, the data on disk probably is not. The disk head has to go running around like a chicken without a head trying to grab all of the data.

Clustered indexes, by comparison, kinda rock. Different products do it differently, but the general idea is that the index and the data are stored together, and in order. The good news here is that all of that random I/O you had to go through for sequential range values of the index goes away, because the data is right there, and in the order dictated by the index. Another big win here which can be really dramatic (in my experience) is if you have an index-covered query (a query that can be completely satisfied by data in the index). This results in virtually no I/O, and extremely fast queries, even on tables with a  million rows or more. The price you pay for this benefit, though, can be large, depending on your system configuration: in order to keep all of that data together in the index, more memory is required. Since InnoDB used clustered indexes, and MyISAM doesn’t, this is what most people cite as the reason for InnoDB’s larger memory footprint. In my experience, I don’t see anything else to attribute it to myself. Thoughts welcome.

Indexes can be tricky, and for some, it looks like a black art. While I am a fan of touting proper data schema design, and that data wants to be organized independently of the application(s) it serves, I think that once you get to indexing, it is imperative to understand how the application(s) use the data and interact with the database. There isn’t some generic set of rules for indexing that will result in good performance regardless of the application. You also don’t have data integrity issues to concern yourself with when developing an index strategy. One question that arises often enough to warrant further discussion is “hey, this column is indexed, and I’m querying on that column, so why isn’t the index being used?”

The answer is diversity. If you’re running one of those crazy high performance web 2.0 bohemuth web sites, one thing you’ve no doubt tossed around is the idea of sharding your data. This means that, instead of having a table with 400,000,000 rows on one server, you’re going to break up that data along some kind of logical demarcation point in the data to make it smaller so it can be more easily spread across multiple servers. In doing so, you might create 100 tables with 4,000,000 rows apiece. However, a common problem with figuring out how to shard the data deals with “hot spots”. For example, if you run Flickr, and your 400,000,000 row table maps user IDs to the locations of their photos, and you break up the data by user ID (maybe a “user_1000-2000″ for users with IDs between 1000 and 2000), then that can cause your tables to be contain far less diverse data than you had before, and could potentially cause *worse* performance than you had before. I’ve tested this lots of times, and found that MySQL tends to make the right call in these cases. Perhaps it’s a bit counterintuitive, but if you test it, you’ll find the same thing.

For example, say that user 1000 has 400,000 photos (and therefore, 400,000 rows in the user_1000-2000 table), and the entire table contains a total of 1,000,000 rows. That means that user 1000 makes up 40% of the rows in the table. What should MySQL do? Should it perform 400,000 multi-step “find the index value, get the pointer to the data, go get the data” operations, or should it just perform a single pass over the whole table? At some point there must be a threshold at which performing a table scan becomes more efficient than using the index, and the MyISAM engine seems to set this threshold at around 30-35%. This doesn’t mean you made a huge mistake sharding your data — it just means you can’t assume that a simple index on ‘userID’ that worked in the larger table is going to suffice in the smaller one.

But what if there just isn’t much diversity to be had? Well, perhaps clustered indexing can help you, then. If you switch engines to InnoDB, it’ll use a clustered index for the primary key index, and depending on what that index consists of, and how that matches up with your queries, you may find a solution there. What I’ve found in my testing is that, presumably due to the fact that data is stored, in order, along with the index, the “table scan” threshold is much higher, because the number of IO operations MySQL has to perform to get at the actual data is lower. If you have index-covered queries that are covered by the primary key index, they should be blazing fast, where in MyISAM you’d be doing a table scan and lots of random I/O.

For the record, and I’m still investigating why this is, I’ve also personally found that secondary indexes seem to be faster than those in MyISAM, though I don’t believe there’s much in the way of an advertised reason why this might be. Input?

Joins, and Denormalization

For some time, I read about how sites like LiveJournal, Flickr, and lots of other sites dealt with scaling MySQL with my head turned sideways. “Denormalize?! Why would you do that?!” Sure enough, though, the call from on high at all of the conferences by all of the speakers seemed to be to denormalize your data to dramatically improve performance. This completely baffled me.

Then I learned how MySQL does joins. There’s no magic. There’s no crazy hashing scheme or merging sequence going on here. It is, as I understand it (I haven’t read the source), a nested loop. After learning this, and twisting some of my own data around and performing literally hundreds, if not thousands of test queries (I really enjoy devising test queries), I cringed, recoiled, popped up and down from the ceiling to the floor a couple of times like that Jekkyl and Hyde cartoon, and started (carefully, very carefully) denormalizing the data.

I cannot stress how carefully this needs to be done. It may not be completely obvious which data should be denormalized/duplicated/whatever. Take your time. There are countless references for how to normalize your data, but not a single one that’ll tell you “the right way” to denormalize, because denormalization itself is not considered by any database theorists to be “right”. Ever. In fact, I have read some great theorists, and they will admit that, in practice, there is room for “lack of normalization”, but they just mean that if you only normalize to 3NF (3rd Normal Form), that suits many applications’ needs. They do *NOT* mean “it’s ok to take a decently normalized database and denormalize it”. To them, normalization is a one way street. You get more normalized – never less. These theorists typically do not run highly scalable web sites. They seem to talk mostly in the context of reporting on internal departmental data sets with a predictable and relatively slow growth rate, with relatively small amounts of data. They do not talk about 10GB tables containing tens or hundreds of millions of rows, growing at a rate of 3-500,000 rows per day. For that, there is only anecdotal evidence that solutions work, and tribal war stories about what doesn’t work.

My advice? If you cannot prove that removing a join results in a dramatic improvement in performance, I’d rather perform the join if it means my data is relatively normalized. Denormalization may appear to be something that “the big boys” at those fancy startups are doing, but keep in mind that they’re doing lots of stuff they’d rather not do, and probably wouldn’t do, if they had the option (and if MySQL didn’t have O(n2 or 3) or similar performance with regard to joins).

Do You Have an IO Bottleneck?

This is usually pretty easy to determine if you’re on a UNIX-like system. Most UNIX-like systems come with an ‘iostat’ command, or have one readily available. Different UNIX variants show different ‘iostat’ output, but the basic data is the same, and the number you’re looking for is “iowait” or “%iowait”. On Linux systems, you can run ‘iostat -cx 2′ and that’ll print out, every 2 seconds, the numbers you’re looking for. Basically, %iowait is the percentage of time (over the course of the last 2-second interval) that the CPU had to hang around waiting for I/O to complete so it would have data to work with. Get a read of what this number looks like when there’s nothing special going on. Then take a look at it on a moderately loaded server. Use these numbers to gauge when you might have a problem. For example, if %iowait never gets above 5% on a moderately loaded server, then 25% might raise an eyebrow. I don’t personally like when those numbers go into double-digits, but I’ve seen %iowait on a heavily loaded server get as high as 98%!

Ok, time for bed

I find database-related things to be really a lot of fun. Developing interesting queries that do interesting things with data is to me what crossword puzzles are to some people: a fun brain exercise, often done with coffee. Performance tuning at the query level, database server level, and even OS level, satisfies my need to occasionally get into the nitty-gritty details of how things work. I kept this information purposely kind of vague to focus on high-level concepts with little interruption for drawn out examples, but if you’re reading this and have good examples that support or refute anything here, I’m certainly not above being wrong, so please do leave your comments below!

Blogged with Flock