Sriram Krishnan (Moved to http://www.sriramkrishnan.com/blog)

Search. Usability. Virtual machines.Geek stuff

<October 2008>
SuMoTuWeThFrSa
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678


Navigation

Subscriptions

News

Link blog
Technorati Profile
The Blogs I read
Creative Commons Licence
This work is licensed under a Creative Commons License.


How to write a search engine in 16 hours

When I did part of the TechEd keynote in Bangalore along with Somasegar, I remember cracking a silly joke about the movie 'How to lose a guy in 10 days'. For some strange reason, memories of that came back to me as I was writing this post.

My college wanted my batch to do a mini-project so me and a bunch of guys decided to do a search engine(in reality, I didn't give my friends much choice :-) ). As usual, we dilly-dallied and due to other work (exams, labs,etc), I suddenly found myself with 24 hours in which to write a search engine as I had a review the next day. 2 months work compressed into 24 hours. Sounded like fun - and it was!

I would have preferred to use Python but as my friends already knew a bit of .Net and Vb 6, I stuck to Vb.Net. I fired up VS.Net and opened up a blank solution and started twiddling my thumb. Maybe this search engine thing *is* tough after all.

Crawler

I started coding up the crawler , deciding to go step-by-step ( a part of this experience led to this post on crawlers). At this point,Murphy's law jumped in and my internet connection wasn't working.This meant I couldn't look at any sample code, regular expressions,etc. Great. I already had a design in mind based on the Mercator crawler - so my work was to mainly to translate the design in my head into code.

Since this was meant as a college mini-project, quality assurance and performance flew out the window. My code is filled with comments such as 'Fix this- this would break under..blah..blah..' or 'performance sucks when..blah..blah.'. I think I've been reading too much of Rico Mariani's posts - whenever I code, I see visions of objects dancing in my head. And I mentally cringe to myself whenever I create a new object or do any string manipulation.

After reading a few of  Rico's posts, you'll find two little versions of yourself pop-up over your shoulders. One has a halo and is dressed in white while the other looks like an ad for the FreeBsd devil. The angel tells you "Look at all those string allocations..tsk..tsk..and you haven't measured your performance. Unacceptable!". The devil says "Just keep coding - you have a review tomorrow". After a few hours, I stopped listening to the angelic guy and the FreeBSD devil mutilated him with his pitchfork. But I digress...

I hit my first obstacle pretty soon - the cocktail of .Net's ThreadPool and the WebRequest suite of classes makes for an explosive combo as the Webrequest classes use the Threadpool themselves. This means that you have a significantly lesser number of threads to play with. Not having the time to write my own thread pool, I wrote a hack to limit the number of threads I used.

My second obstacle was on Html. You see - websites have really,really bad Html. And writing regular expressions to match them all is a pain. As I didn't have an internet connection, I hacked together my own set of regular expressions instead of downloading a library from the internet. I tried to handle all the special cases I could think of - but my crawler is still horribly broken. But it works perfectly for the sites I plan to crawl for my review - so I'm happy with it.

 

Time elapsed so far : 6 hours (mostly taken in thread pool mucky-muck  and regular expression grime)

Indexing

Indexing is the process of building a table in memory so that you can find search terms pretty quickly. Search engines normally use a reverse-lookup index. A reverse lookup index is one , which for every word, stores a list of documents in which the word appears (as opposed to a normal index which maps every document to all the words it contains).

I started coding up my own reverse-index implementation but then I realized that I was fighting a losing battle. There is no way I would have been able to implement boolean queries, fuzzy queries, porter stemming and an efficient index lookup in under a day. Luckily for me, I had a copy of NLucence (the .Net port of the excellent Apache Lucene indexing engine which even Lookout uses) stashed away on my system.

However, using Lucene isn't that straightforward. It feels like Java code - you have abstract classes all over the places and you have factory-like classes all over the place. Leaves a Java taste in your mouth. After some hacking around, I could figure out how to get the thing working and soon Lucene was on its happy way indexing. Doug Cutting has done an awesome job.

Time elapsed so far : 12 hours

Retrieval

So far so good. I've a crawler up and running and I'm getting everything indexed. What I need now is some way to display the results. Now, for those of you familiar with how India's college syllabi works might be wondering as to how I got permission for such a 'different' project. The reason is - I lied. I told my faculty that our search engine would crawl only educational sites and have lot of educational usage. What I ..err..forgot to tell them was that apart from educational sites, I also planned to crawl the rest of the web.

But one implication of this was that every reviewer would ask us 'How are you different from Google?". So we decided to do a search engine whose only external interface is a web service (similar to the Google API). I planned to write a simple ASP.NET client for a Google-like interface, a Winforms client to tie it in with something like Winamp,etc.

I quickly wrote a web-service wrapper around my indexing engine and then wrote a Google-like ASP.NET interface. By now, I was running out of time so I grabbed the Html for Google's search results page and stole it for my interface. Hey - why reinvent the wheel when Google's UI engineers have gone to so much trouble for me? Seriously though, I plan on replacing it sometime with my own UI once I have the time. But for now, it is chugging merrily along displaying results in a Google-like interface.

 

Btw, the search engine is called 'Agni' which is the Hindu God of Fire for those of you not familiar with Hindu mythology. As for the reasons behind the name, well, my team came up with it as they thought it sounded cool.:-)

 

After 14 hours of hard work, I was finally done. I set up a few sample sites on my own web server and ran a few tests. The next day, the reviewers were blown away and my batch got the highest scores overall (Yippee!)

Yesterday, I went back to the code and hacked around for a few hours adding stuff like the snippets that Google shows from each page,etc. I now have tremendous respect for the people at Google - they've taken care of so many minute details that you wouldn't notice until you tried writing your own.

Here's a screenshot of how my Asp.net interface looks like - familiar , isn't it? Hope I don't get sued or something :-)

 

 

 

posted on Tuesday, October 26, 2004 1:43 PM by sriram





Powered by Dot Net Junkies, by Telligent Systems