Sriram Krishnan (Moved to http://www.sriramkrishnan.com/blog)

Search. Usability. Virtual machines.Geek stuff

<July 2008>
SuMoTuWeThFrSa
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789


Navigation

Subscriptions

News

Link blog
Technorati Profile
The Blogs I read
Creative Commons Licence
This work is licensed under a Creative Commons License.


Tuesday, October 26, 2004 - Posts

How to write a search engine in 16 hours

When I did part of the TechEd keynote in Bangalore along with Somasegar, I remember cracking a silly joke about the movie 'How to lose a guy in 10 days'. For some strange reason, memories of that came back to me as I was writing this post.

My college wanted my batch to do a mini-project so me and a bunch of guys decided to do a search engine(in reality, I didn't give my friends much choice :-) ). As usual, we dilly-dallied and due to other work (exams, labs,etc), I suddenly found myself with 24 hours in which to write a search engine as I had a review the next day. 2 months work compressed into 24 hours. Sounded like fun - and it was!

I would have preferred to use Python but as my friends already knew a bit of .Net and Vb 6, I stuck to Vb.Net. I fired up VS.Net and opened up a blank solution and started twiddling my thumb. Maybe this search engine thing *is* tough after all.

Crawler

I started coding up the crawler , deciding to go step-by-step ( a part of this experience led to this post on crawlers). At this point,Murphy's law jumped in and my internet connection wasn't working.This meant I couldn't look at any sample code, regular expressions,etc. Great. I already had a design in mind based on the Mercator crawler - so my work was to mainly to translate the design in my head into code.

Since this was meant as a college mini-project, quality assurance and performance flew out the window. My code is filled with comments such as 'Fix this- this would break under..blah..blah..' or 'performance sucks when..blah..blah.'. I think I've been reading too much of Rico Mariani's posts - whenever I code, I see visions of objects dancing in my head. And I mentally cringe to myself whenever I create a new object or do any string manipulation.

After reading a few of  Rico's posts, you'll find two little versions of yourself pop-up over your shoulders. One has a halo and is dressed in white while the other looks like an ad for the FreeBsd devil. The angel tells you "Look at all those string allocations..tsk..tsk..and you haven't measured your performance. Unacceptable!". The devil says "Just keep coding - you have a review tomorrow". After a few hours, I stopped listening to the angelic guy and the FreeBSD devil mutilated him with his pitchfork. But I digress...

I hit my first obstacle pretty soon - the cocktail of .Net's ThreadPool and the WebRequest suite of classes makes for an explosive combo as the Webrequest classes use the Threadpool themselves. This means that you have a significantly lesser number of threads to play with. Not having the time to write my own thread pool, I wrote a hack to limit the number of threads I used.

My second obstacle was on Html. You see - websites have really,really bad Html. And writing regular expressions to match them all is a pain. As I didn't have an internet connection, I hacked together my own set of regular expressions instead of downloading a library from the internet. I tried to handle all the special cases I could think of - but my crawler is still horribly broken. But it works perfectly for the sites I plan to crawl for my review - so I'm happy with it.

 

Time elapsed so far : 6 hours (mostly taken in thread pool mucky-muck  and regular expression grime)

Indexing

Indexing is the process of building a table in memory so that you can find search terms pretty quickly. Search engines normally use a reverse-lookup index. A reverse lookup index is one , which for every word, stores a list of documents in which the word appears (as opposed to a normal index which maps every document to all the words it contains).

I started coding up my own reverse-index implementation but then I realized that I was fighting a losing battle. There is no way I would have been able to implement boolean queries, fuzzy queries, porter stemming and an efficient index lookup in under a day. Luckily for me, I had a copy of NLucence (the .Net port of the excellent Apache Lucene indexing engine which even Lookout uses) stashed away on my system.

However, using Lucene isn't that straightforward. It feels like Java code - you have abstract classes all over the places and you have factory-like classes all over the place. Leaves a Java taste in your mouth. After some hacking around, I could figure out how to get the thing working and soon Lucene was on its happy way indexing. Doug Cutting has done an awesome job.

Time elapsed so far : 12 hours

Retrieval

So far so good. I've a crawler up and running and I'm getting everything indexed. What I need now is some way to display the results. Now, for those of you familiar with how India's college syllabi works might be wondering as to how I got permission for such a 'different' project. The reason is - I lied. I told my faculty that our search engine would crawl only educational sites and have lot of educational usage. What I ..err..forgot to tell them was that apart from educational sites, I also planned to crawl the rest of the web.

But one implication of this was that every reviewer would ask us 'How are you different from Google?". So we decided to do a search engine whose only external interface is a web service (similar to the Google API). I planned to write a simple ASP.NET client for a Google-like interface, a Winforms client to tie it in with something like Winamp,etc.

I quickly wrote a web-service wrapper around my indexing engine and then wrote a Google-like ASP.NET interface. By now, I was running out of time so I grabbed the Html for Google's search results page and stole it for my interface. Hey - why reinvent the wheel when Google's UI engineers have gone to so much trouble for me? Seriously though, I plan on replacing it sometime with my own UI once I have the time. But for now, it is chugging merrily along displaying results in a Google-like interface.

 

Btw, the search engine is called 'Agni' which is the Hindu God of Fire for those of you not familiar with Hindu mythology. As for the reasons behind the name, well, my team came up with it as they thought it sounded cool.:-)

 

After 14 hours of hard work, I was finally done. I set up a few sample sites on my own web server and ran a few tests. The next day, the reviewers were blown away and my batch got the highest scores overall (Yippee!)

Yesterday, I went back to the code and hacked around for a few hours adding stuff like the snippets that Google shows from each page,etc. I now have tremendous respect for the people at Google - they've taken care of so many minute details that you wouldn't notice until you tried writing your own.

Here's a screenshot of how my Asp.net interface looks like - familiar , isn't it? Hope I don't get sued or something :-)

 

 

 

posted Tuesday, October 26, 2004 1:43 PM by sriram

Congrats Vivek!

The Hindu, India's leading newspaper, published an article today about a paper done by Vivek Srikumar and another classmate of mine. Vivek(http://svivek.blogspot.com), apart from being a  close friend and the funniest guy I've ever met...also happens to be the smartest guy I've ever met. And the guy with the second weirdest hair-do (the first being Deepak Gulati).It's kind of weird seeing an article about someone you roamed the city with a few days ago.

The article talks about Vivek's paper on robotics using genetic algorithms. Vivek and Prashanth are traveling to Massey University,New Zealand in mid-December to present their paper at an international conference on robotics. To quote Vivek from the article

...The challenge is to get a set of robots to learn to do a task without prior human training, and become efficient according to the terrain. We have used a technique called genetic algorithm-based training.

Normally, when a baby tries to walk, he or she does not know about the behavior it needs to walk the first time. The baby tries some movements, makes some trials, falls, gets up and after four or five times learns to walk. The robots do the same thing - start with little or no knowledge, follow the same methodology as the baby, refine their behavior according to the terrain and figure out the most efficient way of doing it.But to speed up the learning process, we simulate the trials, the robots and everything around them, including the environment. The virtual robots make trials in the virtual environment that is almost similar to the actual thing and learn to map it. Once the virtual robots learn , we copy the `learnt behavior's into a real robot....

 

 Mind-blowing stuff.

I hate to hijack a post about Vivek like this but I couldn't help but thinking of how this really shows that smart people aren't present at only the IITs. I've seen a series of articles this year about students from 'normal' colleges showcasing their ideas abroad.

Scoble had a post on Microsoft recruiting recently. You would think Microsoft would want to hire people like Vivek for its IDC at Hyderabad. However, the truth is Vivek can't get into Microsoft after leaving college even if he wants to - as our college isn't considered 'good enough'. People at the IDC tell me there are various valid internal reasons as to why Microsoft does this(choose only a few colleges for recruitment) . I would love to know what those reasons are as the situation looks pretty sad at the moment.

And before I forget, Vivek is also a Microsoft Student Champ and has done tons of amazing work in colleges around Chennai and at various UG meets.
 
Microsoft, are you listening? Do you want to lose Vivek to Google?
 
Congrats Vivek - rock on!


posted Tuesday, October 26, 2004 12:36 PM by sriram




Powered by Dot Net Junkies, by Telligent Systems