I came across a cool paper called ”Measuring the Similarity between Implicit Semantic Relations using Web Search Engines” by Bollegala, Matsuo and Ishizuka from the University of Tokyo (WSDM 09).
It’s all about calculating the implicit semantic relatedness between word pairs using search engines. You enter 2 words and it returns the relation between those.
[Google, Youtube] – here the relation is ACQUIRER-ACQUIREE. Similar ones would be [Yahoo, Inktomi] for example. You could find all the relationships available of this type if you wanted you.
[ostrich, bird] and [lion, cat - ostrich is a large bird and lion is a large cat so the implicit relation is LARGE.
[Muslim church] should return “mosque”
[Hindu bible] should return “the Vedas”
Existing keyword-based search engines can’t really do this because they retrieve documents that match the user query and not the relationships between the keywords provided.
How it works:
1- A query is entered
2 – Web search occurs to find the context of the word pairs
3 – lexical pattern extraction occurs
4 – Pattern clustering occurs using feature vectors…
5 – Then inter-cluster correlation occurs
6 – The relational similarity score is calculated
The lexical patterns are automatically extracted and the the similarity between different semantic relations is done using an inter-cluster correlation matrix.
More on the web-search method:
They use text snippets returned by a Web search engine as an approximation of the context of two words.
They also use multiple queries per word-pair that induce different rankings, and aggregate search results as ranking differ with the number of wildcards used.
Similarities between words:
Attributional similarity measure: If two words show a high degree of attributional similarity they are called synonyms.
Relational similarity measure: Word-pairs that show a high degree of relational similarity are considered as analogies.
Why it’s difficult to do:
1 - relational similarity is a dynamic phenomenon. Relations between companies, people and so on change constantly
2 - all relations between the two words in each word-pairs have to be extracted before similarity can be measured
3 – There can be more than one way a particular semantic relation can be expressed in a text
4 – WordNet does not cover all the names entities (nouns, proper nouns) that occur in queries
Why it’s good:
It performs really well. It “significantly outperforms the state-of-the-art relational similarity measure in a relation classification task”.
It doesn’t require the use of NLP processing to complete the task.
It’s language independent.
Why should you care?
It allows you to find the relationships between different entities, words, or whatever on the web. It gives really good insight into how the search engines function/could function as far as putting queries in context is concerned. As an SEO professional it could give you a further method to look into your keywords, and as a computing professional, it gives you an interesting idea to build on. It also fits nicely into a lot of different systems.