A useful method is presented in “Categorizing and Ranking Search Engine’s Results by Semantic Similarity” by Tianyong Hao, Zhi Lu, Shitong Wang, Tiansong Zou, Shenhua GU, Liu Wenyin (University Hong Kong). Google ranks results according to page relevance and importance. This importance is determined by a variety of factors all present in the ranking mixture. It can be argued that “importance” is quite a hairy topic seeing as it’s so subjective. The solution then is to rank results by similarity instead.
“We first obtain nouns and verbs from snippets obtained from search engine using Name Entity Recognition and part-of speech. A semantic similarity algorithm based on WordNet is proposed to calculate the similarity of each snippet to each of the pre-defined categories. A balanced similarity ranking method combined with Google’s rank and timeliness of the pages is proposed to rank these snippets. Preliminary experiments with 500 labeled questions from TREC03 show that 72.7% are correctly categorized.”
Snippet = paragraph (Title) + paragraph (Abstraction)
The method is in a very basic way:
- Get results from Google
- Acquire the snippets
- Extract named entities (names, company/organization names, locations,dates & times, percentages and monetary amounts, that sort of thing)
- Parse (adding part-of-speech tags – all stop words are dropped)
- Calculate similarity between the results (Using WordNet)
- Classify each result by topic
Usually when we measure semantic similarity, it is indeed usual to use edit distance. They look more towards concepts rather than words alone. The following 2 ideas are used together:
“…the density of two words on the semantic path is employed because it reflects the weight of categorizing in WordNet. That is, the deeper a word lies, the less it weighs.”
“Since the similarity between the two words is affected by the Information Content of the two concepts, that is, in WordNet taxonomy, the similarity of two concepts should be smaller when the information content of the least common subsumer node (LCS) is smaller. For example, in Fig 2, the word “Carnivore” is the LCS, suppose we change LCS to “Entity”. The instances of “Entity” are much bigger than that of “Carnivore”, thus, the information content of “Entity” is smaller. Therefore, the similarity between “Dog” and “Cat” will be smaller”.
so: Avg _ Sim(Snippet,Topic)
They use Google’s original ranking results and re-rank using:
- semantic similarity of the snippet to the current topic (as described above)
- Impact over time (the difference between cached time and current time)
Over 70% of the results were correctly classified and the overall ranking was in fact seen to be very effective. We can however say that the true evaluation here is human evaluation, or user evaluation more specifically. The main conclusion though: it works.
Why should you care?
Ranking algorithms are showing up all the time, and there are more and more that are doing better and better. PageRank has its flaws as we all know, and they have been somewhat of a popular topic in the research community. New ways of ranking information like the one presented here remind us that there is more than one way to do it. They also put into question existing techniques. Clearly PageRank is great, it’s a wonderful way of creating some order, but it would be good to see it extended as well.
To the SEO this adds more dimensions to the site placement issue, and more ways than one of measuring success in the rankings. The nice thing is that you get to collect a lot more information related to the semantics of the site and where in the world it fits in relation to other sites. What makes a site computer closer in similarity to another? This would be an important question.