Sriram Krishnan (Moved to http://www.sriramkrishnan.com/blog)

Search. Usability. Virtual machines.Geek stuff

<September 2008>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011


Navigation

Subscriptions

News

Link blog
Technorati Profile
The Blogs I read
Creative Commons Licence
This work is licensed under a Creative Commons License.


Solving the blog bandwidth problem using Bittorrent

Last evening, I sat around with a couple of friends(Balakrishnan and Anand) to brainstorm ideas for my college-finishing project. After a lot of mucking around, we got this idea that seemed cool at first, not so cool afterwards and then cool again.

 

The Problem

As anyone reading Scoble's blog would know, blogs today are growing at a phenomenal rate - and so are the bandwidth requirements. Earlier this year, we saw blogs.msdn.com's frontpage go from full text  to snippets. A lot of blogs still give snippets in their RSS instead of the entire feed. And the problem gets worse with Podcasting (and possibly Videoblogging in the future). Each Podcast weighs in at around 40-50 MB - and bandwidth bills can easily skyrocket. If Dave Winer's podcasts ever gets Slashdotted, he could be facing some serious server problems. Obviously, HTTP as a transport medium for blogging doesn't scale. Obviously, the web interfaces are still important - but once you throw in frequently-updating aggregators, podcasting and the lot,you hit some serious scalability issues.

Some history

As you might have figured out from the title, we plan to use Bittorrent to solve this bandwidth issue. Honestly, when we thought of it, we were excited and thought this was something totally new and revolutionary. It was a major letdown later on to see that this idea has already been thought of - but for some strange reason, not worked upon a lot.

The idea(called Broadcatching - a name that I really don't like) seems to have started with Dan Gilmor's article and later on, spread through the blogosphere.There also seems to be some early software releases by Andrew Grumet. The Wikipedia page on broadcatching has a few other links as well -including to plugins/bittorrent clients that support RSS,etc. Scott Raymond also has an interesting blog post on a variation of the idea. Also, our idea isn't that far from the Coral network on nyud.net.

I've spent the last few hours reading through all of this and they seem to be reasonably different from what we have in mind. In case one of you knows of someone doing similar work, do leave a comment.

Frankly, we don't know how good this idea is. There is a huge possibility that it may all be a pile of crap and we need to start over tommorow.If you think this is crap, do leave a comment as to why you think so - and how it could be fixed if at all. There is a fair amount of handwaving over minor and some not-so-minor details. We decided to fill in the blanks later rather than get distracted early on.

The Solution(we hope :-))

Assumptions
1.Blog server's bandwidth is costly. We need to avoid hitting the original server as much as possible

2. It is acceptable for the download of small files to be relatively slow (from the user's perspective). We're guessing this is acceptable as in most cases, it doesn't matter whether a 100 KB RSS feed is downloaded in 30 sec or 60 sec by your aggregator. Bittorrent doesn't handle small files very well - so this is an important assumption (and hopefully not a handwave)

3. Blog RSS feeds are small enough to keep around for a long time. Both client side and server side harddisk space is cheap.

The basic idea is to use Bittorrent to remove the load of the server. But unlike the others, we want to do this transparently - i.e with no effort from the bloggers in terms of software and minimal effort from the aggregator developers. This is probably the main difference between our implementation and the others - we wanted to make sure that every single blogger can automatically use this without having to lift a thumb. This is pretty important - as people on BlogSpot, LiveJournal or even DotNetJunkies don't have control over their blogging platforms.

Our implementation (which we're calling Smoke for now in an attempt to avoid the name BlogTorrent :)) takes care of this by moving the job of maintaining the tracker from the blogging platform to a separate,known site.

We have 2 basic components

1. The Smoke Client
2. The Smoke Server

Smoke Client

The client  runs on every user's machine.When a blog's URL is given to it (for e.g, scoble.weblogs.com), the client (Client A)contacts the Smoke server for the .torrent file for that particular blog (each blog is unique based on the URL of its feed).. Once it gets that, it connects to the other peers having the same content and starts downloading/uploading in typical Bittorrent fashion. If the Smoke server doesn't have a .torrent file, the client contacts the blog server(userland.com for example) and downloads the feed. It also tells the server that it has the feed now on the local disk - therebey registering itself as a seed.

Now, let's say a client on another machine (Client B) wants scoble.weblogs.com. It contacts the Smoke server and gets a .torrent file. It also gets told by the tracker that client A has the feed. So here's the important thing - so instead of hitting poor Scoble's server again, this time the content comes from Client A. This keeps multiplying - Client C would download from both Client A & B. D would download from A.B.C and so on (typical Bittorrent).

In the case of podcasting, the client contacts the server with the mp3/wma file's URL and then on, proceeds like above.

Now, I said the client 'is given a blog URL'. Don't take this mean we're developing our own aggregator - the idea is to avoid doing that. Our idea is to develop an API which people like Nick Bradbury and Dare Obsanjaro can then use in their aggregators. We could keep it very simple - the Smoke client could just dump the feeds into a folder which Feeddemon or any other aggregator can pick up and display using its UI. We want to act as the middle man between the aggregator and the server.

We would probably be developing a bare-bones aggregator as a proof of concept though- as without that,we really wouldn't know where our client sucks.

Smoke Server

Our initial idea was to develop some blogging software which ran trackers also.But the problem with that approach is that there is no way people can change/update their blogging platform easily. Quite frankly, I wouldn't have been able to use it as I don't have control over the .Text engine running on DotNetJunkies.

The Smoke server is like the bastard child of weblogs.com and the Bittorrent tracker sites like Suprnova.This server maintains a big table of sorts, mapping blog URLs to their .torrent files and also, does the tracker work too. This would consume a fair amount of hard disk space though -each torrent file could be 20-30 KB and the size on harddisk would be the No_Of_Blogs_Accessed * 20 KB..which is pretty big. Whenever a client contacts the server with a blog URL, the server serves out a .torrent file.

We haven't bothered with how the client and the server talk - since it is so limited it could be anything - XML-RPC/SOAP/anything. We can figure that out later.

One problem here is blog updates - how do we check to see whether the blog has been updated. One primitive idea I've had is for the Smoke server to contact all the blogs on its list (which theoretically could be all the blogs in the world) periodically and see whether they've been updated (using HTTP commands rather than downloading the entire feed). If the blog has changed, the process is reset - and the next client to request that blog would hit the blog server.

Now here's the big question - how does the Smoke client know where the server is? The answer to that isn't that elegant - but hopefully it should work. The idea right now is to have one huge Smoke server run at someplace like weblogs.com or bloglines.com or anyone else who can provide the bandwidth required.Also, how can one server handle so many blogs? Frankly, I don't know the answer to that- perhaps we could only hold the blogs updated/requested in the last couple of days for example. This way, once the initial flashcrowd is gone, the original blog server can go back to serving files. Keeping track of a blog on the server for a period of time since it was last requested would take care of the mini-Slashdot effect when someone like Doc Searls updates his blog.

 If people are worried about this being a single point of failure, one solution could be to run several known servers and then update the list from the Smoke software's site(wherever we put it up for download).

This is obviously a huge handwave - but I hope that someone throws their weight behind this. What I have in mind isn't that different from weblogs.com being pinged. Obviously, we need to do more work here-and I'm hoping that this post would get me some of that required help.

Miscellaneous

1. This would probably be open source - though I don't know how other people developing the code would stand with our college administration.Anyhow, the source would be given freely.

2. As for platform and language, I was tempted initially to write the first ever Bittorrent client in .NET. But I think Python would be a better choice for a multitude of reasons.Initially, we want to do a Windows client for the sole reason that we're comfortable with it. But it should be easy to port to any OS.

Do post comments/suggestions. I hope this doesn't go the way of my earlier big projects :-)

So tell me everybody - what do you think?

posted on Tuesday, January 04, 2005 4:23 AM by sriram





Powered by Dot Net Junkies, by Telligent Systems