I thought I’d share ”Countering Web Spam with Credibility-Based Link Analysis” by James Caverlee (Texas A&M University) and Ling Liu (Georgia Institute of Technology) at PODC’07 today.
PageRank,TrustRank and HITS all couple link credibility and page quality, which isn’t ideal because good links doesn’t necessarily mean that you have a quality page here. I think page authority and quality are very important areas of research right now.
So, these guys used a credibility-based link analysis and called it “CredibleRank”. The credibility of information is directly used in the quality assessment of each page. It proves to be way more more spam-resilient than both PageRank and TrustRank. These two algorithms rely on the assumption that the quality of a page and the quality of a page’s links correlate. This unfortunately leaves them open to spam.
CredibleRank incorporates credibility information directly into the quality assessment of each page on the Web.
They found that a page’s link quality should depend on it’s own outlinks and that it is related to the quality of the outlinks of its neighbours. So they use the local characteristics of pages and place in the Web graph as opposed to the global properties of the entire Web that the other algorithms use.
Relying on a whitelist (set of known good pages) isn’t very useful because Spammers can camoflage their low rubbish outlinks to spam pages by linking to known whitelist pages. They advocate the use of a Blacklist (known spam pages) instead, where the proximity of page to spam pages. They’re penalised for low quality outlinks.
“First, the initial score distribution for the iterative PageRank calculation (which is typically taken to be a uniform distribution) can be seeded to favor high credibility pages. While this modification may impact the convergence rate of PageRank, it has no impact on ranking quality since the iterative calculation will converge to a single final PageRank vector regardless of the initial score distribution.”
They found that CredibleRank does not negatively impact good sites, because they compared the ranking of each whitelist site under PageRank against its ranking on CredibleRank, and the fluctuation was only of 26 spots, so it isn’t unfairly treating clean sites.
It proves to be so far spam resilient and efficient, and outperforms TrustRank and PageRank. Excellent stuff.