Ranking by semantic similarity

A useful method is presented in “Categorizing and Ranking Search Engine’s Results by Semantic Similarity” by Tianyong Hao, Zhi Lu, Shitong Wang, Tiansong Zou, Shenhua GU, Liu Wenyin (University Hong Kong). Google ranks results according to page relevance and importance. This importance is determined by a variety of factors all present in the ranking mixture.  It can be argued that “importance” is quite a hairy topic seeing as it’s so subjective. The solution then is to rank results by similarity instead.

“We first obtain nouns and verbs from snippets obtained from search engine using Name Entity Recognition and part-of speech. A semantic similarity algorithm based on WordNet is proposed to calculate the similarity of each snippet to each of the pre-defined categories. A balanced similarity ranking method combined with Google’s rank and timeliness of the pages is proposed to rank these snippets. Preliminary experiments with 500 labeled questions from TREC03 show that 72.7% are correctly categorized.”

Snippet = paragraph (Title) + paragraph (Abstraction)

The method is in a very basic way:

- Get results from Google

- Acquire the snippets

- Extract named entities (names, company/organization names, locations,dates & times, percentages and monetary amounts, that sort of thing)

- Stem

- Parse (adding part-of-speech tags – all stop words are dropped)

- Calculate similarity between the results (Using WordNet)

- Classify each result by topic

- Re-rank

Similarity measure:

Usually when we measure semantic similarity, it is indeed usual to use edit distance.  They look more towards concepts rather than words alone. The following 2 ideas are used together:

“…the density of two words on the semantic path is employed because it reflects the weight of categorizing in WordNet. That is, the deeper a word lies, the less it weighs.”

“Since the similarity between the two words is affected by the Information Content of the two concepts, that is, in WordNet taxonomy, the similarity of two concepts should be smaller when the information content of the least common subsumer node (LCS) is smaller. For example, in Fig 2, the word “Carnivore” is the LCS, suppose we change LCS to “Entity”. The instances of “Entity” are much bigger than that of “Carnivore”, thus, the information content of “Entity” is smaller. Therefore, the similarity between “Dog” and “Cat” will be smaller”.

so: Avg _ Sim(Snippet,Topic)

Ranking:

They use Google’s original ranking results and re-rank using:

- semantic similarity of the snippet to the current topic (as described above)

- Impact over time (the difference between cached time and current time)

So…

Over 70% of the results were correctly classified and the overall ranking was in fact seen to be very effective. We can however say that the true evaluation here is human evaluation, or user evaluation more specifically.  The main conclusion though: it works.

Why should you care?

Ranking algorithms are showing up all the time, and there are more and more that are doing better and better. PageRank has its flaws as we all know, and they have been somewhat of a popular topic in the research community. New ways of ranking information like the one presented here remind us that there is more than one way to do it.  They also put into question existing techniques. Clearly PageRank is great, it’s a wonderful way of creating some order, but it would be good to see it extended as well.

To the SEO this adds more dimensions to the site placement issue, and more ways than one of measuring success in the rankings. The nice thing is that you get to collect a lot more information related to the semantics of the site and where in the world it fits in relation to other sites.  What makes a site computer closer in similarity to another? This would be an important question.

Related Posts:


15 Comments Add Yours ↓

  1. 1

    Amazing content once again. I’ve been following your blog posts since a couple of months ago, and sure this is an amazing SEO Science blog. Congratz!

  2. CJ #
    2

    Thank you!

  3. 3

    I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future.

  4. 4

    Right on !! Damn I’m getting addicted to your blog :)

  5. CJ #
    5

    :)

  6. 6

    How do I add this to my RSS reader? Sorry I’m a newbie :(

  7. 7

    Fantastic Article, make me want to learn the english language in so much more detail. Semantics is the future and has been on the SERP’s for a long time. I have just diownloaded your tool and will be looking at how it can benefit myself and my clients.

    Kepp the knowledge flowing….!

    Lee

  8. 8

    Hello,
    Where are you from? Is it a secret? :)
    Robor

  9. CJ #
    9

    It is a secret. My planet hasn’t been discovered yet.

  10. CJ #
    10

    Thanks Lee!

  11. 11

    This is a great SEO blog. I’ve been back a few times and there’s always something that makes me think about where a lot of SEOers are headed. Keep up the good work.

  12. CJ #
    12

    Thank you Jeff!

  13. 13

    CJ, you evidently have expert knowledge, since I don’t I have difficulty following the implications of your piece. Can you spell out a 3 or 4 point ‘do this’ guide, please?

  14. 14

    Another great post. It never ceases to amaze me at the complexity of data classifiaction and retrieval. I wish our clients knew more about what goes in to getting their website ranked and showing up on page 1 of the SERPS. Often my clients have no idea and whinch when you mention the cost of building their website with SEO in mind, especially at the level that is required to make a difference to their website. I just found your website tonight, great work and an Alladins cave of info from my perspective. keep up the great work.

    John

  15. CJ #
    15

    Thanks John,

    much appreciated!


1 Trackbacks/Pingbacks

  1. Ruud Questions: Marie-Claire Jenkins | Search Engine People | Toronto 05 06 09

Your Comment






© 2009-2013 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.8.1, a WordPress plugin for Twitter.