How does a search engine know what words mean?

 

book text How does a search engine know what words mean?

 

Word sense disambiguation (WSD) belongs to the field of computational linguistics.  It’s the research area dedicated to finding ways for machines to understand the meaning of words. More precisely, it’s about determining the word sense of a particular word in a context.   This is really important as without this, it’s difficult for search engines, machine translation systems, dialogue systems, speech systems and so many more to function properly (or at all).  

How hard is it?

It’s described as an “AI-complete” problem which means that solving it is at least as hard as the most difficult problems in artificial intelligence.  Researchers began looking into it in 1949 in the field of machine translation.  It’s so hard for machines to deal this problem that it still hasn’t been resolved. SENSEVAL mini rdf How does a search engine know what words mean?takes place every year, and systems are evaluated, findings are shared and discussed.

WSD uses a lot of the work done in cognitive linguistics, psychology, linguistics and artificial intelligence as well.  When we approach language problems like this, some are surprised to hear that the philosophy of language is also researched and discussed.  

Problems:

We don’t even know enough about how humans process world knowledge and link it to grammar, performing all types of linguistic manipulations to communicate and understand language.  Some believe we learn it from a young age, others that it is built into us, and there is always the possibility that we don’t use such complex methods and we abstract everything.  This mini rdf How does a search engine know what words mean?is nice overview by Steven Pinker a very well known linguist. You can find a list of linguistsmini rdf How does a search engine know what words mean? categorised under their particular theory on Wikipedia.

Another issue is when words have more than one meaning (polysemy), when they are metaphors or are an extension of another word (metonymic).

Language is very ambiguous (meaning can be misleading – more than one meaning) and so are the senses used by different people.  It is not always possible for the machine to give a very precise solution because of this variance.  In fact humans have been shown to be about 90% accurate.  

Examples of ambiguity in language:

- “Drunk gets 10 years in violin case”

- “The lady hit the man with an umbrella” 

- “Green” and “Green”

- “Chair” and “Chair

Possible solutions:

knowledge based approach: Dictionaries and thesauri can be used (like WordNet and the Collins dictionary) to try and narrow down the possibilities.  Using these you can find similarities between words and their definitions.  It is also possible to find which semantic network the word belongs to.  There is no manually annotated corpora, it is “raw”.

The Lesk algorithm uses the resources available (dictionary and thesaurus) to find definition overlaps of these words.  This means they can be disambiguated in context.  All of the possible definitions are retrieved for the words, the definition overlaps for all of the possible combinations are found, and then the highest overlap indicates the correct word sense.  The simplified version works on a single word rather than a group of words. It reduces the search space.

Deep approach: A vast amount of information is fed to the machine.  It actually seeks to derive meaning from the body knowledge it is given.  It’s not very useful because of all the data that needs to be gathered and processed first.  It demands very sophisticated artificial intelligence techniques to work well, and we have not yet perfected (or in some cases invented) these.

Shallow approach: This method bypasses the whole idea of getting the machine to understand the text.  Instead it uses natural language processing techniques like ngrams (word groupings), frequency counts, conditional probabilities and other techniques which basically attach extra information to the words provided. A number of texts are prepared by a human who tags everything up correctly.  This information is then given to the machine.  It will identify patterns and use those to derive meaning.  This is achieved by using machine learning techniques like decision trees or naive bayes classifiers for example.

“Bootstrapping” is a method where the machine is given a small amount of tagged up data dn the a large amount of raw data.  It is also equipped with classifiers which proceed to improve on the original classification by finding patterns.  

The problem with methods requiring an initial tagged up corpus is that these are not readily available.  They are expensive to create and hugely time consuming. It has to be repeated for every different language as well, which is really not efficient.  This method does work very well, but in order to work across the broad, in every context you would need millions of tagged up words.

(Part=of-speech tagging is different to WSD because it doesn’t tag words with senses but rather grammatical classes.)

WSD on the web:

The data on the web ranges from websites, to journals, to blogs and many other types of document structure.  This makes the whole corpus of the web, an “unstructured” one. Traditionally search engines use lexicosyntactic analysis which is not deep enough to actually determining meaning in context.  It can’t deal sufficiently well with the range of ambiguity in language. 

In search engines for example, precision is reduced due to queries being typically sparse (not containing enough information).

Systems both supervised or knowledge-based have performed with high precision and low recall (which is very good), even when there were very fine-grained sense distinctions present.  The question is really how many words in a text does a search engine really need to disambiguate to determine what it’s about?  Clearly queries need to be disambiguated and extended due to their being so sparse sometimes, but how much of an entire webpage or site?  Is to disambiguate less actually disambiguating more?
Let’s see how much of a difference the semantic web could make:
To get around the need for a huge corpus to be provided, researchers are increasingly using the web and as more and more tagging appears, they can make use of this to resolve the problem of WSD.  Ontologies available in OWL, human tagging, RDF and many other forms are incredibly useful for computers. Ontologies are very useful because as more and more are made, they form a huge body of knowledge, represented in a format that a machine can read. The cool thing about them is that you can just plug them all in and they can extend each other as well. 

Why should you care?

When website copy is written up for SEO purposes, it would be good to understand how the search engines figure out what your are writing about, based on the words, word groups and distinctions between these.  It also helps to understand how keywords and keyword phrases could be interpreted to discover query intent.  Using historical user queries help narrow this down.  Finding out what the possible historical queries could be i relation to a single query is far more thorough than looking at volumes of searches.

For computer scientists the use of such technology is so very valuable in so so many ways.  It is possible that the techniques being developed for words could be adapted to other sears of science and not only in computing.

Some further resources: 

Learning extraction patterns using Wordnetmini rdf How does a search engine know what words mean? (Mark Stevenson and Mark A. Greenwood, Sheffield University)

WordNet::similaritymini rdf How does a search engine know what words mean? (Ted Pederson)

Word sense ambiguation: clustering related sensesmini rdf How does a search engine know what words mean? (William Dolan, Microsoft research)

Wmini rdf How does a search engine know what words mean?ord Sense Disambiguation and Information Retrievalmini rdf How does a search engine know what words mean? (Mark Sanderson, University of Glasgow)

Meaningful clustering of sense helps boost word sense disambiguation performancemini rdf How does a search engine know what words mean? (Roberto Navigli, University of Rome)

I don’t believe in word senses”mini rdf How does a search engine know what words mean? (Adam Kilgarriff, University of Brighton)

Using Wikipedia for Automatic Word Sense Disambiguationmini rdf How does a search engine know what words mean? (Rada Mihalcea, University of North Texas)

Related Posts:


22 Comments Add Yours ↓

  1. 1

    Nice write up! I thought you might be interested in how we (at Duck Duck Go) handle “sparse” queries. We help the user disambiguate up front in these cases, e.g. http://duckduckgo.com/?q=jaguar

    Take care,

    Gabriel Weinberg
    Founder & CEO, Duck Duck Go

  2. 2

    its a nice article.

    i will use these things on my website named

  3. 3

    Excellent post!

    We would LOVE to republish this on AltSearchEngines.com – with full attribution, of course.

    Charles Knight, editor
    AltSearchEngines.com

    The definitive blog for alternative search engines.

  4. 4

    I like the use of ambiguous to describe language, especially the english language. Use of LSI principles when creating content not only works from the relevancy standpoint but also from the user satisfaction. The most challenging part is all the new slang that gets thrown into the mix at a rapid pace. Without context it is all just noise.

  5. 5

    Great post and survey of various approaches to WSD. Do you have a personal preference from among the approaches you’ve listed?

    By the way, I found out about your website from your interview at Marketing Pilgrim, which I really liked.

  6. 6

    Search engines don’t really need to know what words mean, they’re more interested in the statistics of the word than it’s meaning. How often it occurs in the page, where it occurs in the page, how often it occurs in all pages, etc. Some search engines may do NLP on a very small subset of words, but in general, meaning is mostly useless for indexing & querying.

  7. CJ #
    7

    What you’re talking about is statistical NLP Jacob – it’s still NLP. It uses probability, stochastic methods and statistical methods. You still have to turn the words and patterns into numbers to do anything with them, and so before that you have to assess their properties individually or as a group. “Meaning” is a great example of language being ambiguous. In this context it means “meaning according to the machine” not in the way that a human perceives meaning in words.

  8. CJ #
    8

    Hey Michael, I have had good results using the shallow approach. These days I like to limit the amount of training data I need, but I use bootstrapping too. I think it does depend on what level of granularity you need. Different tools for different jobs. Thank you John! and Charles… go ahead! Thanks all.

  9. 9

    I recently came across your blog and have been reading along. I thought I would leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.

    Joannah

    http://2gbmemory.net

  10. 10

    Great stuff. Nice to read some well written posts. A long way between them.

  11. 11

    Once again an excellent written post from you. Keep it up!

  12. 12

    Enjoying reading your blog. Hard work always pays off.

  13. 13

    Once again an excellent written post from you. Keep it up!

  14. CJ #
    14

    Thank you.

  15. CJ #
    15

    Thanks!

  16. 16

    Gotta love the effort you put into this blog :)

  17. 17

    Its amazing how Artificial Intelligence has developed in the past half-century. Stunning that word interpretation is still to be mastered by machines and success looks weak in the coming few years. Gives a great example of how capable the human brain actually is.

  18. 18

    I don’t think the AI technology is that far off. I just bought a Droid with an app called Google Goggles. You can take pictures of books, stores, logos etc and it will actually do a search from that image. This is only the beginning.

  19. 19

    I believe that though the article is quite correct, it does not properly address the issue of WHY search engines should be able to understand the meaning of words.

    There are basically two possibilities here: One is for language identification and one is for improving search due to context.

    On the first one, I recently wrote in my blog (http://www.seo-translator.com) a 4-part series on how search engines identify languages and what to do about it. The recognition of meaning to identify a language is overkill.

    On the other hand, word recognition so as to achieve the semantic web would be indeed a great achievement, *but* would imply recognition of not just individual words but complete sentences and grasping the meaning of such sentences. A word outside context may have literally thousands of links in many different fields…

    Unfortunately, though the first one is reality, the second is for many years still a problem that escapes current technology. When solved, however, it will be a revolution that will will make change forever Internet.

  20. CJ #
    20

    Thanks for your comment. #2 is actually very possible and a large part of my thesis. You would be surprised at what is possible.

  21. 21

    Hello, CJ

    Thanks for the reply. As I also have a degree in computer Sciences, I have had some contact with this topic, though I admit that it was some time ago, so my know-how on this subject might be slightly outdated.

    However, when I looked at it the tendency was not so much about recognizing the meaning -you very correctly pointed out that even today we do not know exactly how the human mind works- but rather about tagging, so as to convey the meaning by means of tags that would provide the means of classification. The greatest problem is obviously that manual tagging is hard work, and is therefore unlikely to be done massively.

    On the other hand, I guess that you could use for example n-grams to identify series of words that are likely to appear in a specific category, and perform a probabilistic tagging using AI.

    Yet I still skeptic because even today machine translation (which in theory is simpler than language understanding) is still far away from being capable of correctly translating texts, except in a few very specialized domains and even so with some limitations.

    I first tested a machine translation software in 1986 (on a mainframe!) and was still able to a) translate quicker than the machine (writing by hand) and b) far more accurate. There has been improvement, of course, but far slower than expected.

    That is the reason why I do not think that the semantic web will be mainstream on search machines before, say, 10 years. The only exception could be perhaps Google, which already uses n-grams for language recognition.

    In any case, it is a fascinating subject… thanks for sharing your experiences.

  22. 22

    Its amazing how Artificial Intelligence has developed in the past half-century. Stunning that word interpretation is still to be mastered by machines and success looks weak in the coming few years. Gives a great example of how capable the human brain actually is.


2 Trackbacks/Pingbacks

  1. Palapple | SEO Solutions for your Business 17 03 09
  2. How is this not cool? - crashBlog 17 03 09
  3. How to SEO - PageRank History | The Blog on Branding 24 05 09

Your Comment






© 2009-2013 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.8.1, a WordPress plugin for Twitter.