Multisourced Production Infrastructure: History, and a stab at the Future

Startups are pretty fascinating. I work for a startup, and one of my good friends works for another startup. I’ve also worked for 2 other startups, one during the first “bubble”, and another one a few years later. Oh my, how the world of web startups has changed in that time!

1999: You must have funding

The first startup I was ever involved in was a web startup. It was an online retailer. They were starting from nothing. My friend (a former coworker from an earlier job) had saved for years to get this idea off the ground. He was able to get a few servers, some PCs for the developers he hired, and he got the cheapest office space in all of NYC (but it still managed to be a really cool space, in a way that only NYC can pull off), and he hosted every single service required to run the web site in-house. If I recall correctly, he had a web and database server on one machine, and I believe the primary DNS server was on an old desktop machine he had laying around the house. This gave him the ability to build the completely, 100%-functional prototype, and use it to shop for funding.

It worked. They got funding, they bought more and bigger servers. They got UPSes for them (yay!), they got more cooling, a nicer office, and they launched the site, pretty much exactly as the prototype existed, and things were pretty stable. Unfortunately, the VCs who took seats on the board after the first round of financing didn’t understand the notion of “The Long Tail”, so they eventually went under, but that’s not the point.

The point is, that was 8 or 9 years ago. It costed him quite a good bit of his hard-earned savings just to get to a place where he could build a prototype. A prototype! He only really knew Microsoft products, and buying licenses for Microsoft SQL Server, and the developer’s tools (I forgot what they were using as an IDE, but they were a ColdFusion shop) was quite a chunk of money. My friend really only had enough money to put together a prototype, and they were playing “beat the clock” — trying to get a prototype done, and shop for (and get) funding, before the money ran out, because they couldn’t afford the hardware, power, cooling, big-boy internet connection, and the rest of what goes into a production infrastructure. The Prototype->VC->Production methodology was pretty typical at the time.

2003: Generate Some Revenue

In 2003, a couple of years after the bubble burst, I was involved in another startup. This one was 100% self funded, but has been rather successful since. By this time, dedicated hosting was just affordable enough that it was doable for a startup that had some revenue being generated, and that’s what my friend did. He also outsourced DNS completely (through his registrar, if memory serves), but he still hosted his own email, backup, and some other services in-house. He had plenty of hiccups and outages in the first year, but overall it ran pretty well considering all of the things he *didn’t* have to be concerned with, like power, cooling, internet uplinks, cable trays, etc. The world was becoming a friendlier place for startups.

2008: Do it all, on the cheap

Nowadays, the world is a completely different place for startups, and a lot of this is due to the rich set of free (or very cheap) resources available to startups that make it possible for them to do a production launch without the VC funding that used to be required just to get the hardware purchased.

In 2008 you can outsource DNS for relatively little money, and it’ll have the benefit of being globally distributed and redundant beyond what you’re likely to build yourself. You can get Google Apps to host your email and share calendars and documents. You can store backups on Amazon’s S3. You can use something like Eclipse, Komodo Edit, or a number of language-specific environments like Wing IDE or Zend Studio to do “real development” (whatever your definition of that is) for free or relatively cheap. You can also get a free database that is reasonably well-stocked with features and capabilities, a free web server that runs 65%+ of the internet sites in existence, and if you have the know-how (or can get it), you can actually host anything you want, including your entire production infrastructure (within reason, and subject to some caveats) on Amazon’s EC2, for a cost which is tied to what you use, which is cheaper in a lot of cases than either buying or leasing a dedicated server. Multisourcing has arrived!

In looking at this progression from “you must have funding”, to “you’re going to need to generate a little revenue”, to “do it all, on the cheap”, the really obvious question this all raises is:

“Now what?”

Well, this whole 2008 situation is making things better, but… how do I put this… “It’s not soup yet”.

First of all, there is no single platform where you can realistically do everything. Google’s AppEngine is nice, but it has its limitations, for example, you don’t have any control over the web servers that run it, so you can’t, say, add an Apache mod_rewrite rule, or use a 301 redirect, or process your log files, etc. Troubleshooting this application based solely on input from people who are having issues with your app would be difficult.

Amazon’s service gives you far more control, and if you need it, that’s great, but it completely changes how you architect a solution. I think that some of these things are good changes, and are things we should all be thinking about anyway, but Amazon forces you to make decisions about how to plan for failure from the moment you decide to go this route — even if it’s for prototyping, because until persistent storage on EC2 is a reality available to the entire user base, whenever an EC2 instance disappears, so does every bit of the data you added to it. You’ll have to start from scratch when you bring up another instance. You’re also going to have to add more scripts and utilities to your toolbelt to manage the instances. What happen when one disappears? How do you fail over to another instance? How can you migrate an IP address to the running instance from the failed one? How you do all of these things, in addition to just building and installing systems, is different, and that means learning curve, and that means overhead in the form of time (and expense, since you have to pay for the Amazon services to use them to learn on).

There are also now “grid” solutions that virtualize and abstract all of your infrastructure, but give you familiar interfaces through which to manage them. One that I’ve used with some success is AppLogic, but other services like GoGrid and MediaTemple have offerings that emphasize different aspects of this niche “Infrastructure-as-a-service” market. Choose very carefully, and really think about what you’ll want to do with your infrastructure, how you want to manage it, monitor it, in addition to how you’ll deliver your application, and also think about how you’ll be flexible within the confines of a grid solution before you commit, because the gotchas can be kind of big and hairy.

None of these are whole solutions. However, any of them could, potentially, some day, become what we would now call a “perfect solution”. But it still wouldn’t be perfect in the eyes of the people who are building and deploying applications that are having to scale into realms known seemingly only inside some brain vault that says “Google” on it. What those of us outside of that vault would like is not only Google-like scalability, but:

  • global distribution, without having to pledge our souls in exchange for Akamai services. It’s great that I can build an infrastructure on EC2 or GoGrid, but I’d like to deploy it to 10 different geographic locations, but still control it centrally.
  • the ability to tightly integrate things like caching distribution network services with the rest of our infrastructure (because CDNs are great at serving, but not so much at metrics)
  • SAN-like (not NFS-like) access to all storage from any/all systems, without sacrificing the IO performance needed to scale a database properly.
  • As an admin, I want access to all logs from all services I outsource, no matter who hosts it. I don’t believe I can access, for example, our Google Apps logs, but maybe I’ve forgotten to click a tab somewhere.
  • A *RELATIONAL* database that scales like BigTable or SimpleDB

There’s more to it than this, even, but I’ve stopped short to make a point that needs making. Namely, that these are hard problems. These are problems that PhD candidates in computer science departments do research on. I understand that. The database issue is one that is of particular interest to me, and which I think is one of the hardest issues (not only because of its relationship to the storage issue, by the way). Data in the cloud, for the masses, as we’ve seen, involves making quite a few rather grandiose assumptions about how your schema looks. Since that’s not realistic, the alternative is to flatten the look of the data, and take a lot of the variables out of the equation, so they don’t have to make *any* assumptions about how you’ll use/organize the data. “Let’s make it not matter”. Genius, even if it causes me pain. But I digress…

The idea here is just to give some people a clue what direction (I think) people are headed in.

These are also very low-level wants. At a much, much, much higher level, I’d like to see one main, major thing happen with all of these services:

  • Get systems administrators involved in defining how these things are done

I’m not saying that because I want everything to stay the same and think a system administrator will be my voice in that or something. I do *NOT* want things to stay the same, believe me. I’m saying it because it seems pretty obvious to me that the people putting these things together are engineers, and not systems administrators. Engineers are the people you call when you want to figure out how to make something that is currently 16GB fit into 32MB of RAM. They are not the people you call when you want to provide a service/interface/grid/offering/whatever that allows systems folks to build what amounts to a company’s IT infrastructure on a grid/instance/App/whatever.

Here’s a couple of examples:

When I first launched an AppLogic grid, a couple of things jumped out at me. The partitions on the components I launched were 90% full upon first boot, they had no swap partition, and there was no consistency between OS builds, so you can’t assume that a fix on one machine can be blown out via dsh or clusterssh to the rest. The components were designed to be as small as possible, so as to use as little of the user’s alotted resources as possible. In addition, mount points created in the GUI management interface and then mapped to a component… don’t cause the component to have any clue what you just did, which raises the question “umm… why did I bother using the GUI to map this thing to this component if I just have to edit /etc/fstab and mount it in the running instance myself anyway? Back to consistency, this is unlike if you had, say, allocated more RAM or storage, or defined a new network interface on the device in the GUI.

There is no part of EC2 or S3 that looks like a sysadmin was involved in that. It’s a programmer’s platform, from what I can tell. For programmers, by programmers. Luckily, I have enough background in programming that I kind of “get it”, but while I might be able to convince myself that there are parallels between how I approach programming and building infrastructures, it still represents a non-trivial context switch for me to move from working deeply at one to working deeply at the other, so mixing the two as a necessity for moving forward is less than nice.

There is no “database in the cloud” service that looks remotely like there was a database systems person involved at all, that I can tell. I’ll confess to not having used BigTable or SimpleDB, but the reason is because I can’t figure out how to make it useful to me at the moment. These tools are not relational, and my data, though it’s been somewhat denormalized for scaling purposes (compromises taken carefully and begrudgingly – I’d rather change database products, but it’s not in the cards), is nonetheless relational. I’ve considered all kinds of object schemes for data in the past, and I still think that there’s some data for which that can work well, but it’s not currently a solution for me. Once you look at the overhead in managing something like EC2, S3, AppLogic, etc., the very last thing you need is the overhead of a changing data storage/manipulation paradigm.

Should I be hiring systems folks, or developers? Both? Ugh. Just when I thought you could finally get away with building a startup with nothing more than an idea, a sysadmin and a coder, here they go roping me back into hiring a team of developers… to manage the systems… and the data. No good (and I mean *NO GOOD*) can come of developers managing data. I know, I’ve seen ‘em do it.

All of that said, I use all of this stuff. Multisourcing is here to stay – at least until someone figures a whole bunch of stuff out to make unisourcing a viable alternative for systems folks, or they collectively redefine what a “systems person” is, which is an extremely real possibility, but is probably quite a ways off. My $.02. Flame at will ;-)

  • http://taint.org Justin Mason

    Great post, thanks!

    To be honest, I find EC2 much more of a sysadmin-friendly platform than the “pure code” approach of Google AppEngine, where there isn’t even a command line.

    In my experience as a sysadmin (which I was for a good chunk of the ’90s), the experience of rolling out UNIX deployments maps pretty closely to rolling out a grid of EC2 instances — because they _are_ just Linux instances, after all. That’s pretty sysadmin-friendly in my book.

  • m0j0

    Hi Justin,

    I’ll agree that EC2 is more admin-friendly than AppEngine, I don’t think EC2 really makes it a ‘no-brainer’ for a sysadmin either. Rollout is only a tiny fraction of what a sysadmin does, of course, and you don’t address how EC2 has made is really easy to perform ongoing management, maintenance, disaster recovery, backup, etc. I would guess that the reason for this is because… well… the model for doing these things isn’t exactly “same old same old”.

    I’m really excited about working within a computing paradigm that has yet to clearly define itself and is therefore subject to constant, drastic change. I’m also hoping that persistent storage for EC2 makes its way to the rest of the user community sooner than later, because I think that solves *one* of the problems with managing EC2, and can make probably *all* of the rest of the issues a little bit easier to deal with. Static IPs were also nice, but still left a couple of issues unsolved. From what I’ve seen, EC2 seems to me to be the frontrunner out of all of these solutions if you’re coming from a sysadmin background, and it’s progressing rapidly toward being a no-brainer, but it’s not soup yet, and the issues like global distribution and tighter integration amongst the various services one might use are still up in the air.

    I hate to sound like I’m complaining, because I’m really more interested in talking about how these solutions might look or where they might come from. I’m just trying to start a dialog here that I haven’t seen discussed elsewhere.

    Thanks for posting!!

  • http://scottj.info/ Scott Johnson

    Regarding 301s and AppEngine: It can be done. The program running on AppEngine is handed a request. It is the responsibility of that program to provide a response. That response can have a code of 301 and a Location header. But that’s just a minor detail. AppEngine is positioned to be a major competitor to EC2. Especially as new features are added to it.

    Also, you mentioned that in 2008 you can get redundant, globally-distributed DNS for “relatively little money”. Care to share where one can get this?

  • m0j0

    Hi Scott,

    FIrst, regarding DNS, the problem isn’t so much getting distributed, redundant dns, it’s getting it in the numbers you need it. I think the providers have their services set up wrong, but that’s a problem I don’t have the entrepreneurial spirit to solve at the moment :)

    Anyway, you can get something like 100,000 requests for $100 per month from ultradns. You can get it from securityspace.com for less, and there’s nothing on their site about number of requests that I’ve found, but it can be had. Some dedicated hosting providers also let you use their dns, but I’ve never ventured down that path, because I haven’t found one that gives you the control I’d like to have (some require a support ticket to add/change a DNS record – blech).

    As for 301′s, it was kind of an example, and a bad one, as you point out. Maybe everything that any Apache module can do can be emulated in app code.

    I don’t doubt that AppEngine changes the game, and to that extent, it can be a competitor by creating a new playing field, which is The Google Way. But I think they target two very different types of development teams and systems groups. It’s going to be interesting to see how it plays out. I think system administration will be forever changed as a result of these disruptive paradigms, and I think it’s long overdue, to be honest.

  • http://blog.gogrid.com Michael Sheehan

    This is one of the more insightful posts that I have come across that truly articulates the pain that startusp went through (and continue to go through but in different ways) while discussing it from the technology perspective. I too have been through a few startup booms and busts and you have nailed it in terms of the shift from get lots of cash to get a technologically sound product that has an actual business model.

    I see that you mention GoGrid along side EC2, Media Temple, AppLogic and the likes. First off, I’m the guy tasked with evangelizing the GoGrid technology so I’m pleased that you and others have realized what a powerful offering GoGrid brings to startups.

    To dive into the tech a bit, it is rather fascinating to read how people view these offerings as purely for sys-admins or developers and that they might be conceived as being mutually exclusive. In my mind, whichever technology solution you decide on should be one that completes the 80/20 rule, but for all parties involved, with 80 meeting the needs and 20 they could care less about. GoGrid tries to fill holes that other providers may have (e.g., root/admin access, full Win/Linux images, persistence, free load balancing). Obviously, there may be deficiencies with any offering but all the different providers are working for fill those holes.

    I’m curious about the “gotchas” that you talk about. Any product has gotchas, thus the idea about 80/20…hopefully 80 percent of your needs will be met.

    I encourage anyone who is interested in looking at GoGrid as a fundamental offering for startups to contact me (through the GoGrid blog) as I can help evaluate needs and clarify options. What might be good for one, won’t be good for another.

    Again, thanks for the comprehensive read!

    -Michael

  • http://www.3tera.com Peter

    m0j0,

    thanks for the great post – it is a good overview of where we as an industry are and how we got there. And more importantly, what can you do today if you need to get a service up. Some problems are solved, others are in the works (and maybe some are still overlooked).

    BTW, also thanks for all your input on AppLogic (you’re in the top 5 posters). Just a few notes: we’re working on a global application control panel (allowing you to have a common view between all your apps, no matter where they are, and move them around as you please); we have, in fact, added auto-mounting of new volumes (straight per your request). We still leave little disk space and, by default, have no swap (except in the VPS templates) — but have added a simple and safe way to resize volumes.

    While we are away from BigTable-like scalability for relational databases, clustered MySQL, with a little bit of help and some understanding, should be able to take people from start to at least some level of scaling. (We’ll be looking to see at what the PhDs will turn out — or the open source community on more scalable databases — it is the 21st century, after all!)

    Scott, how is AppEngine competitive to EC2? They appear to provide completely different levels of abstraction… this is a bit like saying that PDAs compete with desktops (i.e., while to a degree you can do some things equally well on both, they are still apples and oranges). AppEngine provides an application stack (and in the future, more than one); EC2 provides VMs on demand… Now, if you are saying that AppEngine is more attractive to developers than EC2 because they don’t have to deal with infrastructure, images, /etc/fstabs, etc., I would agree. However, sysadmins are unlikely to get on the AppEngine bandwagon — or at least I haven’t seen anything that will make them do that.

    Regards,
    – Peter