I’m a big word fan, a pattern spotter, and an enthusiast of this paper: “Measuring Semantic Similarity between Words Using Web Search Engines” by Bollegala, Matsuo, Ishizuka (Univeristy Tokyo & AIST).
It’s a way to meaure how closely relate words are using web search engines. The method uses a search engine, exploiting page counts and text snippets returned in the results. We imagine P and Q to be 2 given words where the page count for the queries P, Q and P AND Q are used. Semantic similarity is measured by extracting lexico-syntactic patterns from the text snippets. The scores are then integrated using SVM’s (Support Vector Machines – explained shortly) and similarity measures make use of WordNet synsets.
One of their examples is “apple” which can be associated with the fruit or with the computer company. Most machine readable thesauri and dictionaries don’t list the computer company though. New words and expressions are being made up and change all the time. The web tends to reflect this quite quickly whereas it can take quite some time before they get added into existing thesauri and dictionaries.
Co-occurance measure:
P AND Q = Global measure of co-occurrence of P and Q (“Apple” AND “computer”)
“For example, the page count of the query “apple” AND “computer” in Google 2 is 288,000,000, whereas the same for “banana” AND “computer” is only 3,590,000. The more than 80 times more numerous page counts for “apple” AND “computer” indicate that apple is more semantically similar to “computer” than is “banana.”
Issues with this:
- ignores the location of the words in the pages (even if 2 words occur in a page, they might not be related).
- page count of a polysemous word (a word with multiple senses) might contain a combination of all its senses.
- Because the web is no noisy, some words might occur arbitrarily on some pages.
Using Snippets:
They allow us to get a snapshot of what a page is about without having to download it and process it all. The problem with using them is that because of the scale of the web we can only efficiently process the top results. There’s also no proof that the semantic information necessary is going to be available in the snippets.
An example:
“…let us consider the following snippet from Google for the query Jaguar AND cat.
“The Jaguar is the largest cat in Western Hemisphere and can subdue larger prey than can the puma”
Here, the phrase is the largest indicates a hypernymic relationship between the Jaguar and the cat. Phrases such
as also known as, is a, part of, is an example of all indicate various semantic relations. Such indicative phrases
have been applied to numerous tasks with good results, such as hyponym extraction [12] and fact extraction [27]. From the previous example, we form the pattern X is the largest Y, where we replace the two words Jaguar and cat by two wildcards X and Y.”
Jaguar is a cat, a car brand and also an operating system for computers. A user who searches for Jaguar on the Web, may be interested in either one of these. WordNet only lists the one.
Method:
- Extract snippets
- process through lexico-semantic analyzer
- Rank the patterns extracted
SVM:
The authors integrate page-counts-based similarity scores with lexico syntactic patterns using support vector machines (SVM)
.
They use two-class SVM’s to find the optimal combination of page counts-based similarity scores and top ranking properties. The SVM looks for synonymous word-pairs and non-synonymous word-pairs and classifies them. It’s trained to do so using WordNet sysnsets
. The output from the SVM is converted into posterior probability, and then the semantic similarity is the posterior probability that the 2 words belong to the same synonymous-words class.
Results:
Their method is more accurate than any other method they evaluated against. They found a high correlation with human ratings, and they managed to outperform plain wordnet based approaches. Their next step is “to apply the proposed semantic similarity measure in automatic synonym extraction, query suggestion and name alias recognition”.
Why should you care?
It’s a new way to do keyword research to a degree, you can find which sites have the same types of keywords and at the same time find new ones related to the topic area. It’s not rocket science but it is certainly a very useful method, so there is no stopping anyone from implementing this and experimenting with it also (all credit given where it is due of course).


A very interesting article. I’m curious on your thoughts of how it could be applied to keyword research.
The way keyword research is done today is very one dimensional for the most part. The end goal is usually to be as expansive as possible, and many times using Thesaurus or WordNet expansions can suffice.
With the proposed methodology, the researchers are discovering analogous relationships between keyword pairs. I would imagine that it could prove useful to generate new ideas on topics to pursue or articles to write.
What are your thoughts on the additional applications?
I definitely find the concept interesting, where filtering for related keywords such as, seasonal or negative match, would be compelling and time saving especially if done in an automated fashion.
Although even at the “faster” speed of 100 keyword pairs for 6 hours still seems like a lengthy time. Especially when I’ve been working with orders of magnitudes of Millions+ of distinct keyword phrases.
Lastly, I’m curious if you’d think the results of the research would have been improved or diminished if they had used Google, Microsoft, or Ask snippets instead of just Yahoo! BOSS.