I’m often asked how to make a spider to crawl the web, or a specific site, a directory, look for hubs, and so on. They’re not hard to make and there are a ton load of them out there. There is really no need to reinvent the wheel and write a load of other ones that are all the same. It’s bad engineering, always reuse code. The following presentation is one I use a lot, and it will give you a lot of places to get code, tutorials, information and so on. I also dish out some basic advice.
I hope it’s useful, lots of love, cj


CJ, that’s a great primer on how search engine spiders work and where to start if you want to build your own.
I have to say that I learned quite a bit about technical SEO after I wrote my first spider. It worked great for valid HTML, but failed miserably to even the smallest malformed HTML comments.
The most basic spider is simple to write, but dealing with all those errors, malformed code, and spider traps … that’s what makes up a large part of the code base.
It’s been a few years since I wrote my first spider, and I have to say that I’ve been very particular on writing very clean HTML and JavaScript since. Whether or not that has an impact on search rankings remains to be seen, but at least I’ve dreadfully aware of the potential spider traps when developing new sites.
If you’re a technical SEO, I’d highly recommend dabbling with spiders, just to gain some additional perspective.