A really interesting method, partly for combating search result manipulation is described in “Dr. Searcher and Mr. Browser: A unified hyperlink-click graph” by Poblete (Uni. Pompeu Fabra), Castillo and Gionis (Yahoo).
They worked on making a unified graph representation of the web including both structural and usage information. They modeled the graph using a union of the hyperlink and click graphs of the web. The hyperlink graph shows us the structure of the web whilst the click graph contains both queries and documents showing users’ search behaviour. These graphs together capture the browsing (exploring) and searching (seeking) behaviour of users. If you browse, you’re following the edges on the hyperlink graph and if you’re searching you are following the edges on the click graph.
“Our most important motivation is to model in a unified way the two main activities of users on the Web: searching and browsing, and at the same time to analyze the effects of random walks on this new graph. The intuition behind this task is to measure how the combination of link structure and usage data provide additional information to that contained in these structures independently.”
Both the hyperlink graph and the click graph have serious drawbacks though. Google’s PageRank uses the hyperlink graph but it is wide open to spam and manipulation. Click graphs are sparse (for a page to be accessed, it needs to be present in the search results to start with for that specific query – and there are many results). It also favours pages that are already ranked high in the search engine. They also note that search engines often emphasize precision over recall, so results matching the query exactly (keyword analysis on the text) are returned. This means that many results that are in fact also (or more) relevant do not show for those queries when they ought to.
“The union of these two graphs combines the traditional hyperlink graph, based on connectivity structure, and the click graph, based on search engine usage information. The purpose of this graph is to extend the traditional hyperlink graph into a graph which reflects more accurately users’ natural behavior in the Web.”
What they found:
They observed that ranking using the hyperlink-click graph was about the same quality as using the non-combined click graph with the highest performance. The 2 graphs compliment each other because using clicks as well means that it’s much harder to spam and using the hyperlink structure means that the graph is less sparse also. The click graph has improved connectivity. They say that this means that they can also account for pages users might visit after issuing particular queries.
The random walk model basically refers to the idea that a user will follow hyperlinks at random from the current webpage. Sergey Brin and Larry Page used it in the PageRank algorithm. The random walk on the hyperlink-click graph the walk goes from a query to a document with a certain probability. The exact behaviour depends on whether the current state is a document or a query. If it’s a document then there is a certain probability that the next state will be a query for which there are clicks to another document. If it’s a query, then there’s a probability that the next state is a document for which there are clicks from the query, and the next state has a certain probability of being random.
The combined graph worked well overall. The scores generated by the random walks produced good, stable results which are tolerant to noisy click-through data. It’s also always close to the best hyperlink graph results. They conclude that this method can be used in search engines to produce less biased results. It also helps to address the issue of manipulation or “Adversarial IR”.
They are going to be looking at gtetting over the bias which is inevitably present when working with usage mining (for example high ranked pages tend to be clicked on more often). They’re also going to test their method for other tasks such as link and click spam detection as well as similarity search.
Why should you care?
This paper is writing by some prominent people in the field of search. It describes a way of dealing with the manipulation of search results, which is what SEO’s do when it comes down to it. Even if sites are optimised using very white hat techniques which are advised by the search engines themselves, they negatively affect the results. This happens because of the number of highly relevant sites that are not optimised.
If a method such as this was widely implemented, it does not mean the end of SEO necessarily but clearly a change in practise.
Computer scientists working in IR will like this method because it’s going to help clean up their rankings.