Sriram Krishnan (Moved to http://www.sriramkrishnan.com/blog)

Search. Usability. Virtual machines.Geek stuff

<May 2008>
SuMoTuWeThFrSa
27282930123
45678910
11121314151617
18192021222324
25262728293031
1234567


Navigation

Subscriptions

News

Link blog
Technorati Profile
The Blogs I read
Creative Commons Licence
This work is licensed under a Creative Commons License.


Writing a web crawler

Over the last 2 months, I've written 3 web crawlers of varying degrees of complexity and in 3 different languages (C#, VB.NET and Python) and all of them are at various stages of completion. I thought I'd put together a small piece on what I've learnt so far on writing your own crawler for those who are interesting in going beyond simplistic web crawlers. Though mine are far from being production quality and wouldn't really compete with Google, they are better than most of the sample crawlers out there and compare favorably to a lot of open source crawlers (such as the one in www.nutch.org). I've been writing these crawlers for other search-related stuff I've been doing (which I can't reveal right now) and its been a great deal of fun.

The problems in writing a crawler are numerous - you are faced with dynamic sites, broken html, spider traps and some other weird stuff which really show you how strange and magical a place the world wide web actually is.

1. Robots.txt :I thought I'd put this at no.1 though I usually implement this later in my crawlers. Always, always respect robots.txt. Failing to follow robots.txt is a sure way for webmasters to get really annoyed with you. Be polite and do the right thing - but don't reinvent the wheel. Don't go to the trouble of writing your own parser - there are several available on the net. Download one for your language/platform of choice and use it. This is actually a good area to put in all those unit tests and feel good about yourself.

2. Broken HTML : This is the norm rather than the exception. I've seen unfinished closing tags, weird capitalization and some creative uses of spelling. It is a matter of debate whether you should try and accept broken HTML - but do be mindful of the fact that there is a lot of it out there. Make sure whatever regular expressions you use are upto it.

3. Threading :And now my favorite topic. Over the past 2 months, I've done enough multithreaded debugging to give me nightmares for months. Before you even think of using multiple threads in your crawler, make sure you understand the following things in the same order.

  •  How your operating system implements threads (for e.g FreeBSD thread support is not the same as most others)
  •  How your platform/runtime uses threads internally (for e.g Python's interpreter has a GIL - Global Interpreter Lock which a thread must own before running)
  • How your platform/language exposes threading to you (the various namespaces in Java, .NET. thread and threading in Python).
  • Synchronization primitives exposed by your language and how they map to OS primitives ( this is true for threading too. When you create a Thread in .NET, are you creating an OS thread? Make sure you know the answer).
There are several nice papers online on how to do multi-threading programming - make sure you know them backwards before even contemplating multiple threads. Also, make sure that your application can actually benefit from multiple threads. Now, I'm sure you're going to say that as web crawlers spend a lot of time waiting for content to download, they're perfect for threads. I'll explain in the next section why that may not necessarily be true.
 
Threading is like burning your fingers in the kitchen. Till you've done it, you really won't realize how painful it is. And once you've done it,  you'll never forget to take safety precautions again.
 
And one last tip on threading support in .NET - the WebRequest suite of classes use the ThreadPool class to do their stuff. The ThreadPool has a max limit of 25 threads in v1.1 - this means that if you start using up threads, you can soon start hitting InvalidOperationExceptions as the ThreadPool class refuses to give you more threads. This is something you'd want to keep in mind.If you're using Whidbey, you can increase this number to something much higher.
 
In Python, make sure you understand what the GIL (Global Interpreter Lock) is and why true multi-threading isn't possible in Python. It is going to be interesting to see whether IronPython can fill in here someday
 
4. Networking. Joel once wrote a nice long post on leaky abstractions. If you want to look at some, just take a look a the networking namespaces of any major language today. If you don't follow, read Ian's excellent post on why using threads for networking may not be necessary. A lot of your code is going to be spent in waiting for stuff to be sent to you over the net  - so you better understand what this code is actually doing.
 
For example, DNS lookups can kill you performance-wise on a lot of systems. Like the people behind the Mercator crawler found out, DNS lookups done by gethostbyname() block till they return. Just replacing this with your own implementation might give you a huge difference. Python aficionados - a lot of this is more simple with the excellent 'twisted' suite of networking libraries
 
5. HTTP. There are enough HTTP issues distinct from networking issues that I thought I'd devote a separate bullet to them. Since your crawler is fundamentally issuing HTTP requests, the HTTP protocol is something you should look at. Especially true when sites start throwing gzipped content at you or issuing you strange errors. Python people have to jump through some small hoops to get cookies and redirection to work (cookie support is supposed to be there in Python 2.4 - but there are so many bugs in the urllib2 implementation that I'd advice waiting for the final Python 2.4 release)
 
6. To see or not to see: Your crawler is working perfectly and sites are being downloaded. You're happy and your significant other is impressed with your hacking prowess. Now, comes the tricky part of keeping your boyfriend/girlfriend happy. How frequently do you visit URLS - and how do you decide whether a URL has already been seen or not.
 
There is no easy answer to this. Different crawlers handle this differently - having separate queues for each spider, having content-seen fingerprinting,etc. Read up on some of the papers on the subject and figure out which one works for you.  If you're too lazy, then just put all the seen URLs in one big list and only visit them periodically
 
7. Performance:  As Rico Mariani would say, "Measure , measure , measure!". There are a lot of places for potential bottlenecks and you want to make sure you know exactly where they are. For example, instead of creating a file on the hard-disk for every HTML document, you might want to write several documents at one shot or use a database for storage.
 
8. Seed list: The seed list is the initial list of URLs you feed your hungry spiders. Choose this wisely as this can 'bias' the crawling. If you're too lazy to come up with your own list, just use some directory like dmoz.org as a good starting point.
 
Ok guys - that's about it for now. There's a lot of stuff I haven't covered , mostly due to the fact that I'm not sure how to do it myself. For example, how do you handle dynamic sites which expect user input to go back to its database and retrieve content?
 
Make sure you make your crawler flexible. At a minimum , you should be able to restrict the depth of the crawl and the domains under which the crawler will look - this will take care of spider traps whose sole purpose of existence is to keep spiders running infinitely. Have extensive logging facilities so that if your regex is failing to match some URLs, you know exactly what URLs are being missed.
 
And most of all, have fun. This is one program where there is no dearth of sample data.:-)

posted on Sunday, October 10, 2004 10:00 PM by sriram





Powered by Dot Net Junkies, by Telligent Systems