I watched with interest this weekend as Amazon S3 went down yet again, and I thought to myself, "there but for the grace of God go I."
My company is currently developing a cloud based data service called KloudShare. And though KloudShare is basically unrelated to what S3 does, and probably has more similarity to Google's Big Table or Amazon Simple DB, they are all still data services, and so it got me thinking about how one might architect systems to avoid such messes. Because while people might forgive Amazon, I don't think a tiny startup like mine is going to have the same latitude Amazon does.
First, looking at Amazon's computing service called the Elastic Computing Cloud (EC2) is probably instructive as we have been using Amazon EC2 and have had an instance running for well more than six months (probably approaching a year) without a failure. And so what is clear is that there are ways to design really massive systems that do not have a single choke point. Amazon gets it right with EC2 and less so with S3.
The big question is why S3 is structured in such a way that so many problems they have seem to bring the entire system down. Invariably, things fail. You cannot avoid it no matter how smart you think you are. But what you can often do is limit the collateral damage of failure by compartmentalizing your design. Clearly EC2 instances are quite compartmentalized.
While the data in S3 is clearly stored across separate distinct systems, I would imagine that security and access rules, and perhaps other elements are centralized, though I really have no idea for sure what their internal architecture is like. What is clear though is that as all of us in the cloud computing business think about our designs, sharding or federation of all services within the cloud into separate operational silos is critical. As best we can, we must avoid allowing one failure somewhere to bring down the whole system. Strategies to keep failure localized are critical.
One of the keys in our design has been replication, and eschewing what is known in the database world as normalization. In a normalized database design you are very careful not to store data in more than one place. You want to reference data in its existing place rather than replicate it everywhere. We avoid the principles of normalization because it is impossible to provide massively scalable systems that are normalized. But what we had never considered is that our "anti-normalization" design principle also relates to stability of design.
I think what the S3 issue is demonstrating is that distributed design is critical not just for performance but for reducing the impact of failure. Of course I am not saying that we have figured all of this out yet, and without more thought I suspect we too still have vulnerabilities in our design of the type that brought Amazon down. And so this is not an attack on Amazon but as I see it a teachable moment for all of us working on how to bring the real vision of the cloud to the world. And while it is, of course, impossible to avoid all centralized services in a cloud architecture, clearly Amazon is demonstrating the critical importance of limiting your dependence on them.
Monday, July 21, 2008
Subscribe to:
Post Comments (Atom)

7 comments:
The big question is why S3 is structured in such a way that so many problems they have seem to bring the entire system down.
Um, because S3 is a service and EC2 is simply a data center? If you wanted to compare apples to apples, you compare the actual EC2 web service api to S3, but even that's not a good comparison, seeing as how the EC2 API is a simple thing, isn't doing the kind of work as S3 and certainly isn't anywhere near as loaded as S3.
So you have some machines that haven't failed for 6 months. That's great. S3 isn't a bunch of machines. It's a singular service which can fail for a variety of reasons, non of which actually require machine failure, network failure, etc.
That's why the two services have different characteristics. They're completely different things.
"again"?
You are totally right.
A point I keep making is that every web service you depend on is part of a product chain for your SLA. For example if you depend on two services with SLAs of 99.9% then the actual uptime between the two of them is 99.8%.
Things go down. I've been developing sites on the web for years. Key network paths go down, ISPs go down, server facilities have trucks run into their A/C units. Nothing is perfect and yet we all expect a service that costs 1/10 of what others do to out perform them.
We expect 100% uptime, but yet Amazon hasn't even achieved that with their own site. Everyone is trying to hold them to a higher standard than that which the've set.
Jason,
I'm not trying to hold anybody to any standard. The question is architecturally, are there ways to compartmentalize failure. I do not think that you can read this article as bashing amazon. In fact given the first sentence it is amazing that you could read it that way. That said, I *do* think it is possible to design and S3 system where failure is much less likely to impact the entire system all at once. This is not about cost of operation, but about system design.
Hal,
EC2, as you eventually note, is not just a data center. But that would be irrelevant any way. The real question is whether every single storage system in s3 must be deeply connected to every other. I dont think it is really necessary. I think that you *can* compartmentalize the storage into much smaller units that are very weakly interconnected. As I said, this is not a bash of Amazon, but a great example of what we in the cloud computing business need to be thinking about. And I am.
Despite what many pundits have to say, reliability issues will not be the downfall of cloud computing. Using cloud computing does not mean neglecting to architect solutions that meet their business requirements, including reliability requirements.
I wrote more about this idea here:
Cloud Computing and Reliability
http://faseidl.com/public/item/212584
Post a Comment