Corpus for nasty web spam
Researches who study webspam are limited by the lack of corpus available. There is one that gets used quite often called “WEBSPAM-UK2007“, released by Yahoo. There’s also the 2006 version. It’s really useful but as they say, it was generated to aid the researchers so it’s biased towards their needs. Also, you can’t compare results unless they’re tested on the same collection.
The University of Milan downloaded loads of documents for the collection starting from a set of hosts listed in DMOZ for the uk domain. They followed links recursively in breadth-first mode. Then lots of volunteers tagged it up.
Things that they found that identified a spam host was the number of keywords in the URL, the anchor text in links, sponsored links and content copied from the engine results.
- 8123 tagged as “normal”
- 2113 tagged as “Spam”
- 426 tagged as “undecided”
This also a good resource
for you, listing the characteristics of nasty spam things.
It’s really interesting to research web spam, because at the end of the day it’s one of the most crippling things to a search engine. It ruins quality, and is highly unwelcome in the index, taking up valuable resources. It also ruins the experience for users, and basically spreads a lot of pain in our information seeking community. It’s by no means an easy problem to solve. Links are mostly looked at using methods such as SVM. Maybe it’s time to look beyond links?