Since my review is fast coming up (it's on Tuesday), I've been doing some late night hacking with Agni, my pet search engine. First, I added cached page support the same way Google does it. The cool thing is that it supports the same syntax (cache://) as Google does. Its not as if we can compete with Google or anything, but it's a nice feeling to tell people 'Google does that? We do it too'. :-) Here are a couple of screenshots.
And here's how a cached page looks like
What you would have noticed that is that I'm now getting results from the external web whereas before I used to limit myself to crawling local sites on my computer. Well, last night, after quite a bit of hacking, I gingerly slipped my spider out into the wild and started crawling a few specific sites (like Roshan's site). Nothing can expose bugs faster in your crawler than actually letting it lose on the web and I had to fix quite a few bugs yesterday with the crawler's logic in deciding which links to follow.
One problem I'm dealing with is 'crawler-bias'. Since I'm using a simplistic data structure to store the list of URLs in the queue, my crawler tends to stick on to one site for quite some time - and this kind of pounding could lead to quite a few webmasters getting pissed off. Once my review is over, I'm ripping out this implementation and replacing it with a far more robust mechanism.
I'm not a Lucene expert and I've been learning my way around a bit. For example, I gained some tremendous perf wins yesterday by tweaking a few knobs in Lucene. Indexing 150 documents used to take close to 2 minutes yesterday. Today, the same 150 documents take 20 seconds to index. Woohoo!