@joeharris76: May 2009

NOTE: Don't expect me to be doing multiple posts per day. I don't know what's come over me!

Everyone seems to agree that newspapers are dead. Even here in the UK they're not doing great, although our papers seem to 'get' the web a lot more. One of the things that I hear quite a bit from the pundits is that they should make the physical paper free as well as the online version.

I was just reading a post on Tim Ferriss' blog about Alan Webber and his "RULE #24 - If you want to change the game, change the economics of how the game is played." In it he mentions the free paper theory.

This triggered a thought for me that giving the paper away is nowhere near a bold enough strategy. The problem with the paper is not that it costs too much (except on Sundays - £2! who are they kidding?). For a lot of people, especially the core newspaper market, the cost is not an issue. The issue is having to go get the damn thing, cart it around all day and then filter through the ads just to find a few interesting tidbits.

So here is my "fix": force people to take the paper. Stick it through *everyone's* mailbox every single day. Become *the* alternative delivery provider. I haven't bought a paper in ages but I can *guarantee* that if it came through my door I would look at it.

In the UK (and most of Europe) we have fairly strong opt-out regulation against so-called junk mail. However there is a huge loophole called the "door drop". Marketers are still allowed to put whatever they want through all of the doors in a given area. This allows a lot of room for targeting. Millionaires all live in the same neighborhood right? There is a big business around this. When I was involved (~2yrs ago) it cost about £0.05 per door. Now I get 3 or 4 drops a week, about 20 pieces in total. Hmm… that's sound like £1 of revenue per house minus delivery costs. Seems workable.

Now you wouldn't want to push your paper on literally everyone. You would target the exact slice of the population that already reads you. Plus your economics are now much more predictable. You know exactly how many papers to print and you can streamline your distribution arm. In fact you'd want to buy or partner with someone like DHL or TNT who are already doing alternative deliveries. You also need to get you deliveries done *very* early to catch the commuters.

This is a winner takes all play. There is only room for a handful of players in a market like this. Once they have your paper in their hands why would they buy a competing paper? If you get it right it should pay back in spades.

I don't really see anyone brave enough to make the switch right now. But they'll get more adventurous (desperate) as time goes on and profits dwindle.

Perhaps TNT should think about buying a newspaper group to beef up the delivery pipeline…

UPDATE: One Sentence Summary - It's possible to know in advance who will need to receive messages and therefore to structure the Twitter application and tweet data in a way that makes it much faster to deliver them.

So I got to thinking about Twitter and the ongoing problems they have keeping the service up and running smoothly. This line of thought was triggered by Twitter removing the ability to see all @ replies. This follows a long history of removing features to "streamline" the service (Google it if you care).

It's worth remembering that Twitter started out as a 'plain vanilla' Ruby On Rails app. Which is great, 'cuz RoR is great. But it means that Twitter was conceived as a database backed single instance app. There are tons of article out there about the architecture you need to scale such an app. Some of them where written by Twitter people who have since been ejected. (Again, Google it if you care).

The other thing to remember is that Twitter are only keeping a few weeks of tweets online (6-8 at last reporting). This may be a practical measure but it's also insane! There is huge value in all those old Tweets. I suspect they are doing this to limit the size of their databases. Which is a clue that they are still using a database (or probably a number of sharded databases) as the back-end.

Here's the thing though: Twitter is not a database app. It's a messaging platform. This is not an insight but it is important. We (the IT industry) know how to run messaging platforms at scale. We know how to run huge email services. We know how to run huge IM platforms. We know how to run huge IRC instances.

Of course Twitter is not exactly like any of those things. It's an asynchronous, asymmetric, instant micro-message stream. It's asynchronous because messages are simply pushed out (like email). It's asymmetric because there is no way to guarantee or confirm receipt (like IRC). But it's the instant streaming aspect that is key. That's what makes the experience unique.

My "fix" is based on the following observation: Twitter usage forms naturally into cliques. My wife tried out Twitter and found it boring. She didn't find a tribe that she connected with. I, on the other hand, love it because I can talk trash about Bikes, Business Intelligence and Data Warehousing all day long. What could be better?

Here's the architecture:

Load all of the data into a huge data warehouse (MPP of course!).
Cluster users into their natural cliques using data mining algorithms.
The cliques I follow might be:
- BI-DW (~2,000)
- UK Mountain Biking (~1,000)
- Web 2.0 (~5,000)
- Twitter Celebs (~1,000)
- Of course cliques wouldn't really have names…

The backend database only contains users info, not tweets.

Following, Followers, Clique memberships, Bloom filter of following, etc.

Tweets are stored in "clique streams": all tweets for a clique in reverse order.

New tweets are added to the top/front of the stream.

Tweets can exist in multiple streams as required.

Streams have a maximum message age.

To provide an update the system only has to filter a small number of streams.

This has got to be a 1000x reduction. (60m users to 60k possibles)

The system stores a bloom filter of people a user follows as the first filter for streams.

Probably another 10x reduction, removes bulk of non-following clique messages.

The detailed filter should now be running over a very small dataset.

Final step is to combine the filtered streams and remove any duplicates.

It should go without saying that all tweets are added to the data warehouse in real time. ;-)
This also answers the question of how Twitter can make money: sell access to the data in that killer data warehouse.

{I have refrained from naming any specific technologies or products in the post because that's not really what it's about. Very restrained of me, don't you think?

I also haven't talked about DMs, mentions, etc. because I think that they can easily fit in this architecture and this post doesn't need to be any longer.}

UPDATE 2: This approach also makes it a lot easier to spot spam accounts. Someone may *actually* want to follow 4,000 people but they will only be in a few cliques. A spam account would be following too many different cliques.

Wednesday, 20 May 2009

How to Fix the Newspaper Industry - everybody else is doing it…

How To Fix Twitter - it came to me in the shower…

Disqus for @joeharris76