I’m a big word fan, a pattern spotter, and an enthusiast of this paper: “Measuring Semantic Similarity between Words Using Web Search Engines” by Bollegala, Matsuo, Ishizuka (Univeristy Tokyo & AIST).
It’s a way to meaure how closely relate words are using web search engines. The method uses a search engine, exploiting page counts and text snippets returned in the results. We imagine P and Q to be 2 given words where the page count for the queries P, Q and P AND Q are used. Semantic similarity is measured by extracting lexico-syntactic patterns from the text snippets. The scores are then integrated using SVM’s (Support Vector Machines – explained shortly) and similarity measures make use of WordNet synsets.
One of their examples is “apple” which can be associated with the fruit or with the computer company. Most machine readable thesauri and dictionaries don’t list the computer company though. New words and expressions are being made up and change all the time. The web tends to reflect this quite quickly whereas it can take quite some time before they get added into existing thesauri and dictionaries.
P AND Q = Global measure of co-occurrence of P and Q (“Apple” AND “computer”)
“For example, the page count of the query “apple” AND “computer” in Google 2 is 288,000,000, whereas the same for “banana” AND “computer” is only 3,590,000. The more than 80 times more numerous page counts for “apple” AND “computer” indicate that apple is more semantically similar to “computer” than is “banana.”
Issues with this:
- ignores the location of the words in the pages (even if 2 words occur in a page, they might not be related).
- page count of a polysemous word (a word with multiple senses) might contain a combination of all its senses.
- Because the web is no noisy, some words might occur arbitrarily on some pages.
They allow us to get a snapshot of what a page is about without having to download it and process it all. The problem with using them is that because of the scale of the web we can only efficiently process the top results. There’s also no proof that the semantic information necessary is going to be available in the snippets.
“…let us consider the following snippet from Google for the query Jaguar AND cat.
“The Jaguar is the largest cat in Western Hemisphere and can subdue larger prey than can the puma”
Here, the phrase is the largest indicates a hypernymic relationship between the Jaguar and the cat. Phrases such
as also known as, is a, part of, is an example of all indicate various semantic relations. Such indicative phrases
have been applied to numerous tasks with good results, such as hyponym extraction  and fact extraction . From the previous example, we form the pattern X is the largest Y, where we replace the two words Jaguar and cat by two wildcards X and Y.”
Jaguar is a cat, a car brand and also an operating system for computers. A user who searches for Jaguar on the Web, may be interested in either one of these. WordNet only lists the one.
- Extract snippets
- process through lexico-semantic analyzer
- Rank the patterns extracted
The authors integrate page-counts-based similarity scores with lexico syntactic patterns using support vector machines (SVM).
They use two-class SVM’s to find the optimal combination of page counts-based similarity scores and top ranking properties. The SVM looks for synonymous word-pairs and non-synonymous word-pairs and classifies them. It’s trained to do so using WordNet sysnsets. The output from the SVM is converted into posterior probability, and then the semantic similarity is the posterior probability that the 2 words belong to the same synonymous-words class.
Their method is more accurate than any other method they evaluated against. They found a high correlation with human ratings, and they managed to outperform plain wordnet based approaches. Their next step is “to apply the proposed semantic similarity measure in automatic synonym extraction, query suggestion and name alias recognition”.
Why should you care?
It’s a new way to do keyword research to a degree, you can find which sites have the same types of keywords and at the same time find new ones related to the topic area. It’s not rocket science but it is certainly a very useful method, so there is no stopping anyone from implementing this and experimenting with it also (all credit given where it is due of course).