Multisourced Production Infrastructure: History, and a stab at the Future

Startups are pretty fascinating. I work for a startup, and one of my good friends works for another startup. I’ve also worked for 2 other startups, one during the first “bubble”, and another one a few years later. Oh my, how the world of web startups has changed in that time!

1999: You must have funding

The first startup I was ever involved in was a web startup. It was an online retailer. They were starting from nothing. My friend (a former coworker from an earlier job) had saved for years to get this idea off the ground. He was able to get a few servers, some PCs for the developers he hired, and he got the cheapest office space in all of NYC (but it still managed to be a really cool space, in a way that only NYC can pull off), and he hosted every single service required to run the web site in-house. If I recall correctly, he had a web and database server on one machine, and I believe the primary DNS server was on an old desktop machine he had laying around the house. This gave him the ability to build the completely, 100%-functional prototype, and use it to shop for funding.

It worked. They got funding, they bought more and bigger servers. They got UPSes for them (yay!), they got more cooling, a nicer office, and they launched the site, pretty much exactly as the prototype existed, and things were pretty stable. Unfortunately, the VCs who took seats on the board after the first round of financing didn’t understand the notion of “The Long Tail”, so they eventually went under, but that’s not the point.

The point is, that was 8 or 9 years ago. It costed him quite a good bit of his hard-earned savings just to get to a place where he could build a prototype. A prototype! He only really knew Microsoft products, and buying licenses for Microsoft SQL Server, and the developer’s tools (I forgot what they were using as an IDE, but they were a ColdFusion shop) was quite a chunk of money. My friend really only had enough money to put together a prototype, and they were playing “beat the clock” — trying to get a prototype done, and shop for (and get) funding, before the money ran out, because they couldn’t afford the hardware, power, cooling, big-boy internet connection, and the rest of what goes into a production infrastructure. The Prototype->VC->Production methodology was pretty typical at the time.

2003: Generate Some Revenue

In 2003, a couple of years after the bubble burst, I was involved in another startup. This one was 100% self funded, but has been rather successful since. By this time, dedicated hosting was just affordable enough that it was doable for a startup that had some revenue being generated, and that’s what my friend did. He also outsourced DNS completely (through his registrar, if memory serves), but he still hosted his own email, backup, and some other services in-house. He had plenty of hiccups and outages in the first year, but overall it ran pretty well considering all of the things he *didn’t* have to be concerned with, like power, cooling, internet uplinks, cable trays, etc. The world was becoming a friendlier place for startups.

2008: Do it all, on the cheap

Nowadays, the world is a completely different place for startups, and a lot of this is due to the rich set of free (or very cheap) resources available to startups that make it possible for them to do a production launch without the VC funding that used to be required just to get the hardware purchased.

In 2008 you can outsource DNS for relatively little money, and it’ll have the benefit of being globally distributed and redundant beyond what you’re likely to build yourself. You can get Google Apps to host your email and share calendars and documents. You can store backups on Amazon’s S3. You can use something like Eclipse, Komodo Edit, or a number of language-specific environments like Wing IDE or Zend Studio to do “real development” (whatever your definition of that is) for free or relatively cheap. You can also get a free database that is reasonably well-stocked with features and capabilities, a free web server that runs 65%+ of the internet sites in existence, and if you have the know-how (or can get it), you can actually host anything you want, including your entire production infrastructure (within reason, and subject to some caveats) on Amazon’s EC2, for a cost which is tied to what you use, which is cheaper in a lot of cases than either buying or leasing a dedicated server. Multisourcing has arrived!

In looking at this progression from “you must have funding”, to “you’re going to need to generate a little revenue”, to “do it all, on the cheap”, the really obvious question this all raises is:

“Now what?”

Well, this whole 2008 situation is making things better, but… how do I put this… “It’s not soup yet”.

First of all, there is no single platform where you can realistically do everything. Google’s AppEngine is nice, but it has its limitations, for example, you don’t have any control over the web servers that run it, so you can’t, say, add an Apache mod_rewrite rule, or use a 301 redirect, or process your log files, etc. Troubleshooting this application based solely on input from people who are having issues with your app would be difficult.

Amazon’s service gives you far more control, and if you need it, that’s great, but it completely changes how you architect a solution. I think that some of these things are good changes, and are things we should all be thinking about anyway, but Amazon forces you to make decisions about how to plan for failure from the moment you decide to go this route — even if it’s for prototyping, because until persistent storage on EC2 is a reality available to the entire user base, whenever an EC2 instance disappears, so does every bit of the data you added to it. You’ll have to start from scratch when you bring up another instance. You’re also going to have to add more scripts and utilities to your toolbelt to manage the instances. What happen when one disappears? How do you fail over to another instance? How can you migrate an IP address to the running instance from the failed one? How you do all of these things, in addition to just building and installing systems, is different, and that means learning curve, and that means overhead in the form of time (and expense, since you have to pay for the Amazon services to use them to learn on).

There are also now “grid” solutions that virtualize and abstract all of your infrastructure, but give you familiar interfaces through which to manage them. One that I’ve used with some success is AppLogic, but other services like GoGrid and MediaTemple have offerings that emphasize different aspects of this niche “Infrastructure-as-a-service” market. Choose very carefully, and really think about what you’ll want to do with your infrastructure, how you want to manage it, monitor it, in addition to how you’ll deliver your application, and also think about how you’ll be flexible within the confines of a grid solution before you commit, because the gotchas can be kind of big and hairy.

None of these are whole solutions. However, any of them could, potentially, some day, become what we would now call a “perfect solution”. But it still wouldn’t be perfect in the eyes of the people who are building and deploying applications that are having to scale into realms known seemingly only inside some brain vault that says “Google” on it. What those of us outside of that vault would like is not only Google-like scalability, but:

global distribution, without having to pledge our souls in exchange for Akamai services. It’s great that I can build an infrastructure on EC2 or GoGrid, but I’d like to deploy it to 10 different geographic locations, but still control it centrally.
the ability to tightly integrate things like caching distribution network services with the rest of our infrastructure (because CDNs are great at serving, but not so much at metrics)
SAN-like (not NFS-like) access to all storage from any/all systems, without sacrificing the IO performance needed to scale a database properly.
As an admin, I want access to all logs from all services I outsource, no matter who hosts it. I don’t believe I can access, for example, our Google Apps logs, but maybe I’ve forgotten to click a tab somewhere.
A *RELATIONAL* database that scales like BigTable or SimpleDB

There’s more to it than this, even, but I’ve stopped short to make a point that needs making. Namely, that these are hard problems. These are problems that PhD candidates in computer science departments do research on. I understand that. The database issue is one that is of particular interest to me, and which I think is one of the hardest issues (not only because of its relationship to the storage issue, by the way). Data in the cloud, for the masses, as we’ve seen, involves making quite a few rather grandiose assumptions about how your schema looks. Since that’s not realistic, the alternative is to flatten the look of the data, and take a lot of the variables out of the equation, so they don’t have to make *any* assumptions about how you’ll use/organize the data. “Let’s make it not matter”. Genius, even if it causes me pain. But I digress…

The idea here is just to give some people a clue what direction (I think) people are headed in.

These are also very low-level wants. At a much, much, much higher level, I’d like to see one main, major thing happen with all of these services:

Get systems administrators involved in defining how these things are done

I’m not saying that because I want everything to stay the same and think a system administrator will be my voice in that or something. I do *NOT* want things to stay the same, believe me. I’m saying it because it seems pretty obvious to me that the people putting these things together are engineers, and not systems administrators. Engineers are the people you call when you want to figure out how to make something that is currently 16GB fit into 32MB of RAM. They are not the people you call when you want to provide a service/interface/grid/offering/whatever that allows systems folks to build what amounts to a company’s IT infrastructure on a grid/instance/App/whatever.

Here’s a couple of examples:

When I first launched an AppLogic grid, a couple of things jumped out at me. The partitions on the components I launched were 90% full upon first boot, they had no swap partition, and there was no consistency between OS builds, so you can’t assume that a fix on one machine can be blown out via dsh or clusterssh to the rest. The components were designed to be as small as possible, so as to use as little of the user’s alotted resources as possible. In addition, mount points created in the GUI management interface and then mapped to a component… don’t cause the component to have any clue what you just did, which raises the question “umm… why did I bother using the GUI to map this thing to this component if I just have to edit /etc/fstab and mount it in the running instance myself anyway? Back to consistency, this is unlike if you had, say, allocated more RAM or storage, or defined a new network interface on the device in the GUI.

There is no part of EC2 or S3 that looks like a sysadmin was involved in that. It’s a programmer’s platform, from what I can tell. For programmers, by programmers. Luckily, I have enough background in programming that I kind of “get it”, but while I might be able to convince myself that there are parallels between how I approach programming and building infrastructures, it still represents a non-trivial context switch for me to move from working deeply at one to working deeply at the other, so mixing the two as a necessity for moving forward is less than nice.

There is no “database in the cloud” service that looks remotely like there was a database systems person involved at all, that I can tell. I’ll confess to not having used BigTable or SimpleDB, but the reason is because I can’t figure out how to make it useful to me at the moment. These tools are not relational, and my data, though it’s been somewhat denormalized for scaling purposes (compromises taken carefully and begrudgingly – I’d rather change database products, but it’s not in the cards), is nonetheless relational. I’ve considered all kinds of object schemes for data in the past, and I still think that there’s some data for which that can work well, but it’s not currently a solution for me. Once you look at the overhead in managing something like EC2, S3, AppLogic, etc., the very last thing you need is the overhead of a changing data storage/manipulation paradigm.

Should I be hiring systems folks, or developers? Both? Ugh. Just when I thought you could finally get away with building a startup with nothing more than an idea, a sysadmin and a coder, here they go roping me back into hiring a team of developers… to manage the systems… and the data. No good (and I mean *NO GOOD*) can come of developers managing data. I know, I’ve seen ’em do it.

All of that said, I use all of this stuff. Multisourcing is here to stay – at least until someone figures a whole bunch of stuff out to make unisourcing a viable alternative for systems folks, or they collectively redefine what a “systems person” is, which is an extremely real possibility, but is probably quite a ways off. My $.02. Flame at will 😉

1999: You must have funding

2003: Generate Some Revenue

2008: Do it all, on the cheap

“Now what?”

Share this: