Brain Fried Over NoSQL

So, I’m working on a pet project. It’s in stealth mode. Just kidding — I don’t believe in stealth mode πŸ˜‰

It’s a twitter analytics dashboard that actually does useful things with the mountains of data available from the various Twitter APIs. I’m writing it in Python using Tornado. Here’s the first mockup I ever did for it, just like 2 nights ago:

It’s already a lot of fun. I’ve worked with Tornado before and like it a lot. I have most of the base infrastructure questions answered, because this is a pet project and they’re mostly easy and in some sense “don’t matter”. But that’s what has me stuck.

It Doesn’t Matter

It’s true. Past a certain point, belaboring choices of what tools to use where is pointless and is probably premature optimization. I’ve been working with startups for the past few years, and I’m painfully aware of what happens when a company takes too long to react to their popularity. I want to architect around that at the start, but I’m resisting. It’s a pet project.

But if it doesn’t matter, that means I can choose tools that are going to be fun to dig into and learn about. I’ve been so busy writing code to help avoid or buffer impact to the database that I haven’t played a whole lot with the NoSQL choices out there, and there are tons of them. And they all have a different world view and a unique approach to providing solutions to what I see as somewhat different problems.

Why NoSQL?

Why not? I’ve been working with relational database systems since 1998. I worked on large data reporting projects, a couple of huge data warehousing projects, financial transaction systems, I worked for Sybase as a consulting DBA and project manager for a while, I was into MySQL and PostgreSQL by 2000, used them in production environments starting around 2001-02… I understand them fairly well. I also understand BDB and other “flat-file” databases and object stores. SQLite has become unavoidable in the past few years as well. It’s not like I don’t understand the compromises I’m making going to a NoSQL system.

There’s a good bit of talk from the RDBMS camp (seriously, why do they need their own camp?) about why NoSQL is bad. Lots of people who know meΒ  would put me in the RDBMS camp, and I’m telling you not to cry yourself to sleep out of guilt over a desire to get to know these systems. They’re interesting, and they solve some huge issues surrounding scalability with greater ease than an RDBMS.

Like what? Well, cost for one. If I could afford Oracle I’d sooner use that than go NoSQL in all likelihood. I can’t afford it. Not even close. Oracle might as well charge me a small planet for their product. It’s great stuff, but out of reach. And what about sharding? Sharding a relational database sucks, and to try to hide the fact that it sucks requires you to pile on all kinds of other crap like query proxies, pools, and replication engines, all in an effort to make this beast do something it wasn’t meant to do: scale beyond a single box. All this stuff also attempts to mask the reality that you’ve also thrown your hands in the air with respect to at least 2 letters that make up the ACID acronym. What’s an RDBMS buying you at that point? Complexity.

And there’s another cost, by the way: no startup I know has the kind of enormous hardware that an enterprise has. They have access to commodity hardware. Pizza boxes. Don’t even get me started on storage. I’ve yet to see SSD or flash storage at a startup. I currently work at MyYearbook.com, and there are some pretty hefty database servers there, but it can hardly be called a startup anymore. Hell, they’re even profitable! πŸ˜‰

Where Do I Start?

One nice thing about relationland is I know the landscape pretty well. Going to NoSQL is like dropping me in a country I’ve never heard of where I don’t really speak the language. I have some familiarity with key-value stores from dealing with BDB and Memcache, and I’ve played with MongoDB a bit (using pymongo), but that’s just the tip of the iceberg.

I heard my boss mention Tokyo Tyrant a few times, so I looked into it. It seems to be one of the more obscure solutions out there from the standpoint of adoption, community, documentation, etc., but it does appear to be very capable on a technical level. However, my application is going to be number-heavy, and I’m not going to need to own all of the data required to provide the service. I can probably get away with just incrementing counters in Memcache for some of this work. For persistence I need something that will let me do aggregation *FAST* without having to create aggregation tables, ideally. Using a key/value store for counters really just seems like a no-brainer.

That said, I think what I’ve decided to do, since it doesn’t matter, is punt on this decision in favor of getting a working application up quickly.

MySQL

Yup. I’m going to pick one or two features of the application to implement as a ‘first cut’, and back them with a MySQL database. I know it well, Tornado has a built-in interface for it, and it’s not going to be a permanent part of the infrastructure (otherwise I’d choose PostgreSQL in all likelihood).

To be honest, I don’t think the challenge in bringing this application to life are really related to the data model or the engine/interface used to access it (though if I’m lucky that’ll be a major part of keeping it alive). No, the real problem I’m faced with is completely unrelated to these considerations…

Twitter’s API Service

Not the API itself, per se, but the service providing access to it, and the way it’s administered, is going to be a huge challenge. It’s not just the Twitter website that’s inconsistent, the API service goes right along. Not only that, but the type of data I really need to make this application useful isn’t immediately available from the API as far as I can tell.

Twitter maintains rate limits on the API. You can only make so many calls over so short a period of time. That alone makes providing an application like this to a lot of people a bit of a challenge. Compounding the issue is that, when there are failwhales washing up on the shores, those limits can be dynamically decreased. Ugh.

I guess it’s not a project for the faint of heart, but it’ll drive home some golden rules that are easy to neglect in other projects, like planning for failure (of both my application, and Twitter). Also, it’ll be a lot of fun.

  • Tim

    Have you had a look at what they are doing with FluidDB and twitter. http://blogs.fluidinfo.com/fluidDB/2010/05/06/tunkrank-scores-added-to-fluiddb/

  • Hans

    I’d be interested in learning which nosql databases have non-blocking clients. (We’ve also done a little with Tornado and MongoDB, but haven’t used Tornado with anything other than memory stores or httpclient thus far.) Presumably, the MySQL client that comes bundled w/ Tornado is non-blocking?

    It *sounds* like the data you’re trying to work with here is actually fairly relational. But it also sounds like you have enough experience in the RDBMS camp to know what an RDBMS gives you that these NoSQL databases do not in that regard. I haven’t worked much with MongoDB; we used it on a project where we had a bunch of somewhat loosely structured data (or, rather, lots of contributors that had their own data formats); it was fantastic for that, but our query needs were really basic (not much more complex than pulling things about by primary key).

    One thing I was musing to a coworker is how it takes a differnet kind of discipline to work with the NoSQL store. Our “schemas” kept changing in the beginning (even now) and so we end up with a situation where old provided data had different formats than new data. I don’t really know much about migrating data in these stores, but suffice it to say that I’m much more familiar with the more rigid RDBMS paradigm.

    Anyway, fun decisions. I hope you get to try out a fun project. I’m definitely curious on the non-blocking side … Database access from non-blocking servers (Twisted, Tornado, etc.) seems like the White Elephant in the room.