A very useful and eye-opening paper crossed my desk called “Nullification test collections for web spam and SEO” by Jones and Ramesh, Hawking and Craswell from Canberra University in Australia.
They want to encourage the compilation of a large corpus for adversial IR research. CMU are building one right now called web09-bst. The authors think that it needs to be improved though. Their method is about nullifying sites rather than removing them from the index.
I have always been acutely aware of the issue involving good informative sites that are not optimised being steam rolled by ones that are.
This not to say that all optimised sites are spam due to over optimization (especially compared to non-optimised sites), but that they affect rankings and they may not always be the best results.
The bad techniques we all know about such as link spam, keyword stuffing and so on are classed as web spam. SEO is classified as positive in the way that the pracise involves streamlining pages, but negative when it in involves over-optimisation. It is not easy to make that distinction though.
They mention the Stanford WebBase Project which conducted monthly crawls in 2008/2009 ranging from 61 million to 81 million pages. Web09-bst has a 25 terabyte dataset of about 1 billion web pages crawled in November, 2008. Both contain spam.
They discuss the performance of PageRank, Robust PageRank, TrustRank and Anti-TrustRank. They also discuss the use of standard IR metrics such as MAP, NDCG and infAP.
Here are some snippets from the paper, it’s freely available so you can benefit from it with minimum involvement from me:
“To motivate the idea of nullification as opposed to removal, and to demonstrate that not all content that complicates ranking is also spam”
“…achieving good search results requires the nullifcation of the the thousands of template-driven links and their anchor text.”
“…research into nullifying the negative effect of spam or excessive search engine optimisation (SEO) on the ranking of non-spam pages is not well supported…”
“We introduce the term nullifcation which we see as preventing problem pages from negatively affecting search results”.”
“Research oriented toward measuring the adverse effect of spam and excessive SEO on search engine users cannot be conducted in the absence of sets of realistic queries and corresponding judgments. When selecting queries for evaluation of spam nullification, it is important to select queries of high interest to spammers”.
This last comment would also point towards using highly popular and competitive search terms for SEO’s. While I am in total agreement with the fact that over-optimisation is a serious problem for rankings, I am also of the opinion, as the authors are, that sensible SEO which improves pages for the user as well as the engines is beneficial. The sites that do not get on board need to, and this is simply a natural development of life on the web.