Wednesday 20 May 2009

How To Fix Twitter - it came to me in the shower…

UPDATE: One Sentence Summary - It's possible to know in advance who will need to receive messages and therefore to structure the Twitter application and tweet data in a way that makes it much faster to deliver them.

So I got to thinking about Twitter and the ongoing problems they have keeping the service up and running smoothly. This line of thought was triggered by Twitter removing the ability to see all @ replies. This follows a long history of removing features to "streamline" the service (Google it if you care).

It's worth remembering that Twitter started out as a 'plain vanilla' Ruby On Rails app. Which is great, 'cuz RoR is great. But it means that Twitter was conceived as a database backed single instance app. There are tons of article out there about the architecture you need to scale such an app. Some of them where written by Twitter people who have since been ejected. (Again, Google it if you care).

The other thing to remember is that Twitter are only keeping a few weeks of tweets online (6-8 at last reporting). This may be a practical measure but it's also insane! There is huge value in all those old Tweets. I suspect they are doing this to limit the size of their databases. Which is a clue that they are still using a database (or probably a number of sharded databases) as the back-end.

Here's the thing though: Twitter is not a database app. It's a messaging platform. This is not an insight but it is important. We (the IT industry) know how to run messaging platforms at scale. We know how to run huge email services. We know how to run huge IM platforms. We know how to run huge IRC instances.

Of course Twitter is not exactly like any of those things. It's an asynchronous, asymmetric, instant micro-message stream. It's asynchronous because messages are simply pushed out (like email). It's asymmetric because there is no way to guarantee or confirm receipt (like IRC). But it's the instant streaming aspect that is key. That's what makes the experience unique.

My "fix" is based on the following observation: Twitter usage forms naturally into cliques. My wife tried out Twitter and found it boring. She didn't find a tribe that she connected with. I, on the other hand, love it because I can talk trash about Bikes, Business Intelligence and Data Warehousing all day long. What could be better?

Here's the architecture:
  • Load all of the data into a huge data warehouse (MPP of course!).
  • Cluster users into their natural cliques using data mining algorithms.
  • The cliques I follow might be:
    • BI-DW (~2,000)
    • UK Mountain Biking (~1,000)
    • Web 2.0 (~5,000)
    • Twitter Celebs (~1,000)
    • Of course cliques wouldn't really have names…
  • The backend database only contains users info, not tweets.
    • Following, Followers, Clique memberships, Bloom filter of following, etc.
  • Tweets are stored in "clique streams": all tweets for a clique in reverse order.
    • New tweets are added to the top/front of the stream.
    • Tweets can exist in multiple streams as required.
    • Streams have a maximum message age.
  • To provide an update the system only has to filter a small number of streams.
    • This has got to be a 1000x reduction. (60m users to 60k possibles)
  • The system stores a bloom filter of people a user follows as the first filter for streams.
    • Probably another 10x reduction, removes bulk of non-following clique messages.
  • The detailed filter should now be running over a very small dataset.
  • Final step is to combine the filtered streams and remove any duplicates.
  • It should go without saying that all tweets are added to the data warehouse in real time. ;-)
  • This also answers the question of how Twitter can make money: sell access to the data in that killer data warehouse.
{I have refrained from naming any specific technologies or products in the post because that's not really what it's about. Very restrained of me, don't you think?

I also haven't talked about DMs, mentions, etc. because I think that they can easily fit in this architecture and this post doesn't need to be any longer.}

UPDATE 2: This approach also makes it a lot easier to spot spam accounts. Someone may *actually* want to follow 4,000 people but they will only be in a few cliques. A spam account would be following too many different cliques.

No comments:

Post a Comment

Disqus for @joeharris76