If you’re an advanced MySQL person, you might already know these, in which case, please read anyway, because I still have some questions. On the other hand, f you’re someone who launched an application without a lot of database background, thinking “MySQL Just Works”, you’ll eventually figure out that it doesn’t, and in that case, maybe these tips will be of some use. Note that I’m speaking specifically about InnoDB and MyISAM, since this is where most of my experience is. Feel free to add more to this content in the comment area.
InnoDB vs. MyISAM
Which one to use really depends on the application, how you’re deploying MySQL, your plans for growth, and several other things. The very high-level general rule you’ll see touted on the internet is “lots of reads, use MyISAM; lots of writes, use InnoDB”, but this is really an oversimplification. Know your application, and know your data. If all of your writes are *inserts* (as opposed to updates or deletes), MyISAM allows for concurrent inserts, so if you’re already using MyISAM and 90% of your writes are inserts, it’s not necessarily true that InnoDB will be a big win, even if those inserts make up 50% of the database activity
In reality, even knowing your application and your data isn’t enough. You also need to know your system, and how MySQL (and its various engines) use your system’s resources. If you’re using MyISAM, and you’re starting to be squeezed for disk space, I would not recommend moving to InnoDB. InnoDB will tend to take up more space on disk for the same database, and the If you’re squeezed for RAM, I would also not move to InnoDB, because, while clustered indexes are a big win for a lot of application scenarios, it causes data to be stored along with the index, causing it to take up more space in RAM (when it is being cached in RAM).
In short, there are a lot of things to consider before making the final decision. Don’t look to benchmarks for much in the way of help — they’re performed in “lab” environments and do not necessarily model the real world, and almost certainly aren’t modeled after your specific application. That said, reading about benchmarks and what might cause one engine to perform better than another given a certain set of circumstances is a great way to learn, in a generic sort of way, about the engines.
Indexes are strongly tied to performance. The wrong indexing strategy can cause straight selects on tables with relatively few rows to take an inordinately long amount of time to complete. The right indexing strategy can help you keep your application ‘up to speed’ even as data grows. But there’s a lot more to the story, and blind navigation through the maze of options when it comes to indexing is likely to result in poorer performance, not better. For example, indexing all of the columns in a table in various schemes all at once is likely to hurt overall performance, but at the same time, depending on the application needs, the size of the table, and the operations that need to be performed on it, there could be an argument for doing just that!
You should know that indexes (at least in MySQL) come in two main flavors: clustered, and non-clustered (there are other attributes like ‘hashed’, etc that can be applied to indexes, but let’s keep it simple for now). MyISAM uses non-clustered indexes. This can be good or bad depending on your needs. InnoDB uses clustered indexes, which can also be good or bad depending on your needs.
Non-clustered indexing generally means that the index consists of a key, and a pointer to the data the key represents. I said “generally” – I don’t know the really low-level details of how MySQL deals with its non-clustered indexes, but everything I’ve read leads me to believe it’s not much different from Sybase and MSSQL, which do essentially the same thing. The result of this setup is that doing a query based on an index is still a two-step operation for the database engine: it has to scan the index for the values in the index, and then grab the pointer to get at the data the key represents. If that data is being grabbed from disk (as opposed to memory), then the disk seeks will fall into the category of “random I/O”. In other words, even though the index values are stored in order, the data on disk probably is not. The disk head has to go running around like a chicken without a head trying to grab all of the data.
Clustered indexes, by comparison, kinda rock. Different products do it differently, but the general idea is that the index and the data are stored together, and in order. The good news here is that all of that random I/O you had to go through for sequential range values of the index goes away, because the data is right there, and in the order dictated by the index. Another big win here which can be really dramatic (in my experience) is if you have an index-covered query (a query that can be completely satisfied by data in the index). This results in virtually no I/O, and extremely fast queries, even on tables with a million rows or more. The price you pay for this benefit, though, can be large, depending on your system configuration: in order to keep all of that data together in the index, more memory is required. Since InnoDB used clustered indexes, and MyISAM doesn’t, this is what most people cite as the reason for InnoDB’s larger memory footprint. In my experience, I don’t see anything else to attribute it to myself. Thoughts welcome.
Indexes can be tricky, and for some, it looks like a black art. While I am a fan of touting proper data schema design, and that data wants to be organized independently of the application(s) it serves, I think that once you get to indexing, it is imperative to understand how the application(s) use the data and interact with the database. There isn’t some generic set of rules for indexing that will result in good performance regardless of the application. You also don’t have data integrity issues to concern yourself with when developing an index strategy. One question that arises often enough to warrant further discussion is “hey, this column is indexed, and I’m querying on that column, so why isn’t the index being used?”
The answer is diversity. If you’re running one of those crazy high performance web 2.0 bohemuth web sites, one thing you’ve no doubt tossed around is the idea of sharding your data. This means that, instead of having a table with 400,000,000 rows on one server, you’re going to break up that data along some kind of logical demarcation point in the data to make it smaller so it can be more easily spread across multiple servers. In doing so, you might create 100 tables with 4,000,000 rows apiece. However, a common problem with figuring out how to shard the data deals with “hot spots”. For example, if you run Flickr, and your 400,000,000 row table maps user IDs to the locations of their photos, and you break up the data by user ID (maybe a “user_1000-2000″ for users with IDs between 1000 and 2000), then that can cause your tables to be contain far less diverse data than you had before, and could potentially cause *worse* performance than you had before. I’ve tested this lots of times, and found that MySQL tends to make the right call in these cases. Perhaps it’s a bit counterintuitive, but if you test it, you’ll find the same thing.
For example, say that user 1000 has 400,000 photos (and therefore, 400,000 rows in the user_1000-2000 table), and the entire table contains a total of 1,000,000 rows. That means that user 1000 makes up 40% of the rows in the table. What should MySQL do? Should it perform 400,000 multi-step “find the index value, get the pointer to the data, go get the data” operations, or should it just perform a single pass over the whole table? At some point there must be a threshold at which performing a table scan becomes more efficient than using the index, and the MyISAM engine seems to set this threshold at around 30-35%. This doesn’t mean you made a huge mistake sharding your data — it just means you can’t assume that a simple index on ‘userID’ that worked in the larger table is going to suffice in the smaller one.
But what if there just isn’t much diversity to be had? Well, perhaps clustered indexing can help you, then. If you switch engines to InnoDB, it’ll use a clustered index for the primary key index, and depending on what that index consists of, and how that matches up with your queries, you may find a solution there. What I’ve found in my testing is that, presumably due to the fact that data is stored, in order, along with the index, the “table scan” threshold is much higher, because the number of IO operations MySQL has to perform to get at the actual data is lower. If you have index-covered queries that are covered by the primary key index, they should be blazing fast, where in MyISAM you’d be doing a table scan and lots of random I/O.
For the record, and I’m still investigating why this is, I’ve also personally found that secondary indexes seem to be faster than those in MyISAM, though I don’t believe there’s much in the way of an advertised reason why this might be. Input?
Joins, and Denormalization
For some time, I read about how sites like LiveJournal, Flickr, and lots of other sites dealt with scaling MySQL with my head turned sideways. “Denormalize?! Why would you do that?!” Sure enough, though, the call from on high at all of the conferences by all of the speakers seemed to be to denormalize your data to dramatically improve performance. This completely baffled me.
Then I learned how MySQL does joins. There’s no magic. There’s no crazy hashing scheme or merging sequence going on here. It is, as I understand it (I haven’t read the source), a nested loop. After learning this, and twisting some of my own data around and performing literally hundreds, if not thousands of test queries (I really enjoy devising test queries), I cringed, recoiled, popped up and down from the ceiling to the floor a couple of times like that Jekkyl and Hyde cartoon, and started (carefully, very carefully) denormalizing the data.
I cannot stress how carefully this needs to be done. It may not be completely obvious which data should be denormalized/duplicated/whatever. Take your time. There are countless references for how to normalize your data, but not a single one that’ll tell you “the right way” to denormalize, because denormalization itself is not considered by any database theorists to be “right”. Ever. In fact, I have read some great theorists, and they will admit that, in practice, there is room for “lack of normalization”, but they just mean that if you only normalize to 3NF (3rd Normal Form), that suits many applications’ needs. They do *NOT* mean “it’s ok to take a decently normalized database and denormalize it”. To them, normalization is a one way street. You get more normalized – never less. These theorists typically do not run highly scalable web sites. They seem to talk mostly in the context of reporting on internal departmental data sets with a predictable and relatively slow growth rate, with relatively small amounts of data. They do not talk about 10GB tables containing tens or hundreds of millions of rows, growing at a rate of 3-500,000 rows per day. For that, there is only anecdotal evidence that solutions work, and tribal war stories about what doesn’t work.
My advice? If you cannot prove that removing a join results in a dramatic improvement in performance, I’d rather perform the join if it means my data is relatively normalized. Denormalization may appear to be something that “the big boys” at those fancy startups are doing, but keep in mind that they’re doing lots of stuff they’d rather not do, and probably wouldn’t do, if they had the option (and if MySQL didn’t have O(n2 or 3) or similar performance with regard to joins).
Do You Have an IO Bottleneck?
This is usually pretty easy to determine if you’re on a UNIX-like system. Most UNIX-like systems come with an ‘iostat’ command, or have one readily available. Different UNIX variants show different ‘iostat’ output, but the basic data is the same, and the number you’re looking for is “iowait” or “%iowait”. On Linux systems, you can run ‘iostat -cx 2′ and that’ll print out, every 2 seconds, the numbers you’re looking for. Basically, %iowait is the percentage of time (over the course of the last 2-second interval) that the CPU had to hang around waiting for I/O to complete so it would have data to work with. Get a read of what this number looks like when there’s nothing special going on. Then take a look at it on a moderately loaded server. Use these numbers to gauge when you might have a problem. For example, if %iowait never gets above 5% on a moderately loaded server, then 25% might raise an eyebrow. I don’t personally like when those numbers go into double-digits, but I’ve seen %iowait on a heavily loaded server get as high as 98%!
Ok, time for bed
I find database-related things to be really a lot of fun. Developing interesting queries that do interesting things with data is to me what crossword puzzles are to some people: a fun brain exercise, often done with coffee. Performance tuning at the query level, database server level, and even OS level, satisfies my need to occasionally get into the nitty-gritty details of how things work. I kept this information purposely kind of vague to focus on high-level concepts with little interruption for drawn out examples, but if you’re reading this and have good examples that support or refute anything here, I’m certainly not above being wrong, so please do leave your comments below!
Blogged with Flock