Wednesday, May 21, 2008

A Detailed Five Step Twitter Scaling Plan

I don't want to suggest that the Twitter team doesn't know what they are doing, because I am sure they are, and have been, very smart folks.

And yet, the system keeps falling down (it seemed to be down most of yesterday), and there really is no good reason I can imagine for this to keep happening. I would offer myself up to help them solve the problem, but I suspect they probably have folks smarter than me working on it.

But I say all this to say I am sorry if I am insulting your intelligence, Twitter engineering team. It is not my intent. But I am going to suggest a simple structure that would help to solve your problem. If you have already thought of it or are implementing something better, Godspeed.

The essence of the concept is in the following diagram:



1. Automated CPU provisioning
First, in our work at Kloudshare, we use Amazon Web Services (AWS). You do not have to use AWS, but what you really *do* need is the ability to provision resources more or less on demand, or to just have enough extra server capacity lying around when you need it. The key is automated software that detects certain types of loads and brings on more computing capacity when needed. So you need, ideally, to be able to provision the CPUs on demand in software.

2. Denormalization
The critical concept behind most scaling is denormalization. If you are a database guy/gal you already know what that means, but if you are not I will just explain it in the context of Twitter. In my suggested architecture, every user's outbound tweets should be stored in a separate database from their inbound tweets. So when I look for all the stuff I have sent, it is in one place. When I look at all the things I have received, it is in another place.

In typical normalized databases you would want to store every message sent only once and not in two places as I suggest above. But while perfectly normalized databases make for "cleaner" databases, they cannot scale. Database purists often speak of the merits of normalization. And certainly some normalization is generally necessary. But in the new database world, a perfectly normalized database is an un-scalable database.

3. Sharding
In the above diagram you see that we have separated users into clusters. So we have not just stored "sent tweets" in a different database from "received tweets", we have broken each of these storage types into smaller groups of users. In our simple example, we store user data in clusters of 100 users. This is called sharding.

The purpose of this sharding is that when a user sends a tweet s/he will have no bottleneck in writing that data because only 100 other users are trying to write to that physical server. It can never be overloaded. That server then has the responsibility for writing the sent tweet to all the users that are subscribers to the given senders tweets as well as to the @ recipients (Twitter jargon for those of you not yet initiated).

The beauty of this is that each given database holding "received tweets" is not overburdened either. It only receives messages that are aimed at it. And it only is queried by the users who have their inbound tweets stored on that server.

What is great about this scheme is that it totally unburdens any given physical server and it is infinitely scalable. You could have a hundred or 10,000 such servers and performance scales 100% linearly with the addition of physical machines.

4. Shard Splitting
To be able to add CPUs smoothly, you have to be able to, do something I call shard splitting. In the example I gave above every database holds data for exactly 100 users. But in the real world each database should continue to grow until it starts to be over burdened. Then you "split the shard" into two databases. Each shard should know how to "split itself" when it becomes too busy. So if a database had gotten too slow holding the inbound tweets for 500 users, after the shard split there are two databases that each hold inbound tweets for 250 users each. This self splitting of shards allows for a very organic and automated type of scaling.

5. Distributed User Lookup
Finally, In order to know what machines have what users data there is a bit of overhead in storing a lookup table for each user that indicates where their inbound and outbound data servers are. This is necessary so that each server storing "sent tweets" knows what "received tweet" servers to send each tweet to. I suggest storing this table in memory, or on disk, across all needed servers.

This is a fairly light burden even if you are storing tables for tens of millions of users. I would suggest this table be synchronized across all the servers that need it using something like Terracotta, which is a specialized kind of "plug in" for the Java Virtual Machine that makes it so standard objects can be automatically kept in-memory and synced across multiple servers. In my estimation it is a very powerful tool in the scaling arsenal.

And so, that's it. Twitter scaling in five easy steps. Twitter in a box :)

20 comments:

  1. Hank-
    I love the post and it is something that they seriously need consider...do you mind if I put some of this post on my blog (linking back of course)???

    My blog is http://ryanagraves.com/blog/

    Talk to you soon!

    ReplyDelete
  2. LOL!

    C'mon Twitter team, just throw the quick-n-easy denormalize/shard switch (it's been available since Rails 4.1)!!!

    I'm sure they know this stuff, the question is: how do you write up the shiny new code, and migrate the existing data, and switch to it, while maintaining continuity with an existing user base / ongoing activity, that is more than your data store can currently handle. Not a trivial job, by any stretch.

    It seems they've taken the smart route, and done something similar for their real time infrastructure (SMS, Jabber) first. They're probably working feverishly on something that looks kinda like your chart, only real.

    ReplyDelete
  3. Hank,

    One minor clarification on Terracotta...

    It's not actually a special JVM. Rather, it uses bytecode manipulation to inject clustering behavior at runtime, making threads in different JVMs able to interact with each other as if there were no JVM boundaries. It works with off-the-shelf hotspot and the IBM JDK.

    BTW, here's a (less detailed) post by Ari Zilka on Twitter and Terracotta

    ReplyDelete
  4. You do realize that doing #1 and @4 is actually more work than building Twitter, right?

    ReplyDelete
  5. Very Informative Post ! Hope they will read this and learn thing or two.

    ReplyDelete
  6. ryan,

    No problem. Just dont take the whole thing. Excerpts are fine.

    jchris,

    building a scalable intrastructure is not easy, but they have had a long time to work on this. That said, I think I made it clear that I know they have smart people there and have likely thought of much of this.

    Toby,

    err... regarding #1, using Amazon is not harder than doing than doing twitter. Regarding shard splitting, it is not that hard. Or perhaps I should say its not that much code. I did not say "my suggestions are so easy even an idiot could do it." The bottom line is they need to invest in the intellectual capital to do this kind of stuff if they want to succeed, whether it is hard or not. They have the venture bucks to build a solid infrastructure.

    ReplyDelete
  7. Orion,

    Yeah I will fix that. I know its not true, but when you are using teracotta it just feels like a smarter jvm.

    ReplyDelete
  8. How would this system handle:
    * deleting tweets
    and
    * tracking terms

    To delete a tweet you would need to have references to every denormalized appearance of it. It seems like it would be easy to just look up who is following the twitter user who is deleting his tweet, but this won't work because some users may have stopped following him between the time he tweeted and deleted the tweet. I'm not sure how you can do it without a meta layer of data keeping track of which users saw which tweet. Although, now that I think of it, that doesn't sound so bad. As long as you denormalize it (e.g., by storing a comma-separated list of recipient user ids for each tweet) and then gracefully handle cases where those user ids have since become invalid, such as where a user deletes his account.

    Tracking adds further complexity because each tweet that comes in no longer has a pre-defined recipient list. The recipient list changes depending on the content of that tweet. This was the case for awhile because of the potential for @replies in the tweet, but those were always a manageable number (at 140 characters, probably no room for more than about 20 possible additional recipients due to @replies). If a tweet includes a popular term, however, the recipient list expands greatly, and in ways that are hard to anticipate ahead of time.

    I'm not sure how twitter is doing the tracking, but I assume they must be using some of the concurrency benefits of erlang and message queues, which I'm guessing is why tracking only happens over IM/txt right now. (Or maybe that's a shortcut that they take to avoid having to build a web UI for showing you tweets with terms you are tracking.)

    ReplyDelete
  9. bantic,

    I don't know how twitter is doing tracking either, but it is a very scalable task. You enqueue each twitter at the "tracking engine" which examines each tweet to see who is interested and you send them messages. As many machines as necessary can read the queue and dispatch tracking alerts.

    ReplyDelete
  10. The only problem I see is that unless each shard can hold a sizable number of users, you're going to saturate your network traffic with all of the denormalization.

    ReplyDelete
  11. Anonymous,

    This is a good point, though with the right network architecture it shouldn't be an issue. We use amazon, web services for example, and I don't think machine to machine traffic would be an issue there. Also, my users per server numbers are low in this example. Its likely to be much higher anyway.

    ReplyDelete
  12. I just figured out what was bothering me about this post: The "Sent Messages DBs" don't need to be DBs at all. They can just be queue/switches that propogate the tweets to their final destinations. If you do this, you now have the architecture I started on my Twitter clone about 5 days ago ;-)

    ReplyDelete
  13. Toby,

    This is certainly possible with one caveat. How does a given user find the items that s/he has sent? It seems to me that every user wants to be able to see those items.

    ReplyDelete
  14. Easy: SELECT * FROM userXXXX WHERE user_id=XXXX ORDER BY date LIMIT 20;

    Getting your own tweets is just a special case of getting all your tweets. You can optimize it if you need to, but its not fundamentally special.

    ReplyDelete
  15. Toby,

    Yes of course you can do it by just querying every database. But this is an expensive bottleneck operation. The purpose of the sent items DB is to prevent the query of a given users sent tweets being sent to every server in the system. These queries will become very expensive. Of course if people dont do them much it won't be too bad.

    ReplyDelete
  16. It is worth pointing out that according to Blaine Cook (the recently departed Architect at Twitter), they are demormalizing "a lot"

    http://www.slideshare.net/Blaine/scaling-twitter/

    ReplyDelete
  17. JockM,

    Thanks for that. Its not shocking as I certainly didnt invent the concept and as I said they probably have thought of and are doing much of what I said.

    ReplyDelete
  18. Bantic says:

    "Tracking adds further complexity because each tweet that comes in no longer has a pre-defined recipient list. The recipient list changes depending on the content of that tweet. This was the case for awhile because of the potential for @replies in the tweet, but those were always a manageable number (at 140 characters, probably no room for more than about 20 possible additional recipients due to @replies). If a tweet includes a popular term, however, the recipient list expands greatly, and in ways that are hard to anticipate ahead of time."

    That's not how @replies work. A tweet is tracked as a reply if and only if the start of the string is @somevalidusername. @names elsewhere in the tweet do not count.

    ReplyDelete
  19. Who said anything about "every DB in the system"? Users are sharded onto a particular DB and so you only hit one DB instance to perform the above query. Same goes for replies and direct messages, too.

    ReplyDelete
  20. Michael Rose,
    Oh, I wasn't aware of that. That should make things even simpler, then, though, right?

    ReplyDelete