Sriram Krishnan (Moved to http://www.sriramkrishnan.com/blog)

Search. Usability. Virtual machines.Geek stuff

<July 2008>
SuMoTuWeThFrSa
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789


Navigation

Subscriptions

News

Link blog
Technorati Profile
The Blogs I read
Creative Commons Licence
This work is licensed under a Creative Commons License.


Thursday, October 28, 2004 - Posts

More Agni hacking

Since my review is fast coming up (it's on Tuesday), I've been doing some late night hacking with Agni, my pet search engine.  First, I added cached page support the same way Google does it. The cool thing is that it supports the same syntax (cache://) as Google does. Its not as if we can compete with Google or anything, but it's a nice feeling to tell people 'Google does that? We do it too'. :-)  Here are a couple of screenshots.

And here's how a cached page looks like

 

What you would have noticed that is that I'm now getting results from the external web whereas before I used to limit myself to crawling local sites on my computer. Well, last night, after quite a bit of hacking, I gingerly slipped my spider out into the wild and started crawling a few specific sites (like Roshan's site). Nothing can expose bugs faster in your crawler than actually letting it lose on the web and I had to fix quite a few bugs yesterday with the crawler's logic in deciding which links to follow.

One problem I'm dealing with is 'crawler-bias'. Since I'm using a simplistic data structure to store the list of URLs in the queue, my crawler tends to stick on to one site for quite some time - and this kind of pounding could lead to quite a few webmasters getting pissed off. Once my review is over, I'm ripping out this implementation and replacing it with a far more robust mechanism.

I'm not a Lucene expert and I've been learning my way around a bit. For example, I gained some tremendous perf wins yesterday by tweaking a few knobs in Lucene. Indexing 150 documents used to take close to 2 minutes yesterday. Today, the same 150 documents take 20 seconds to index. Woohoo!

posted Thursday, October 28, 2004 3:26 AM by sriram




Powered by Dot Net Junkies, by Telligent Systems