The twitter problem

I am a patient person, so it’s only now that Twitter‘s perennial scaling problems are bothering me me enough to blog about them. That probably makes me the last person to do so, ever.

However, lately, it’s hurt. It hurt most when I tried to implement a twitter bot at Mashed08. With API calls throttled down to 20 per hour, the best I could hope to do (via polling, and with IM shut out, that was the only obvious path) was to be a bot for one person making no more than one request every 10 minutes. So for the demo, the twitter connection was really baling wire and duct tape (or, ipython console and cut-and-paste into twitter’s web form).

Last month, I read Tim Bray’s Twitterbucks entry with interest. When I last checked in, nobody seemed to be interested in where the real scaling problems were, so the comment thread didn’t come up with any real revelations.

Today, as I tried to reflect on why I use twitter, I came upon another potential solution: pay for what is the hardest to scale: disk access. When any high-volume application has to hit the spindles, it takes a massive performance hit. Twitter’s recent outages seem to address that at least partially: paging backwards in your personal + friends timeline is scaled back, as are examining replies.

Seems to me that much of what Twitter covers well is the “now” and recent past. Going back in time on a merged timeline makes for increasingly expensive queries, and reaching further back in history goes beyond the memory caches. If Twitter didn’t try to keep all its posts accessible, it could be a much more efficient messaging platform, always living in an amnesiac present. By having a web-accessible memory, with persistent tweets, it becomes a lot more difficult to predict where the database is going to be hit.

So take a look at Twitter as it stands right now. With the buttons that are disabled, which ones are the biggest pains? Single-user pagination? Friend pagination? Replies? Seems to me that the biggest omission is in having zero reply-page functionality, but the complex query (user > friends > friends’ updates that can be seen > merged and sorted in time) database hit makes sense to limit. Why not cull functionality for all users such that it’s either a complex query that hits memcached exclusively (the pages representing what’s happening now and in the recent past), or a very trivial query that is allowed to hit the disks (a single, permalinked tweet or any user’s front page of recent tweets). A twitter caught in the present, and exhibiting some memory when specifically prodded.

From there, you could charge for more archival access. I imagine this not as a monetisation move, or even one that could directly cover additional costs, but one that would allow serious users with serious needs self-select, not unlike what Flickr has done with their paid accounts. A paid user could access their archives as a continual stream of tweets on a blog-like page. They could access a more comprehensive memory of their friends’ replies. They might even be given persistent past per-day or per-month archive pages.

I’ll admit that I don’t fully appreciate the particular scaling problems presented by heavy users like Scoble. Perhaps there are payment thresholds to pass once you follow 500 and 5000 users?

What do people think? I know I can’t be the first to suggest it, but it’s the first I’ve heard of it.