Why PLSI is prefered over LSI

Here is quick explanation of why “Latent semantic indexing” LSI is unlikely to be used as is in modern search engines like Google.  It’s an old technique that has limitations, and solutions to these have been developed over the years since its introduction to IR research.

First, very quickly, a no-maths definition of LSI as a refresher:
“Semantic” = meaning
“Latent” = present but hidden
LSI = The analysis of the hidden meaning of words and how often they occur in a document.
Its advantage over simple keyword analysis (Boolean search = True or False) is that it can infer meaning from words which is not evident, and match words which would not normally happen with other methods.  For example, “computer”, “PC”, “laptop” are all connected.  Documents are put together even if it is not obvious that they are connected, because a “latent semantic space” is created.  
It uses vectors in a a high-dimensional vector space (lots of them).  It creates a term-document matrix from all the documents.  Then 3 matrices are created using SVD (singular value decomposition) (also the second matrix houses the singular values of the original matrix in a diagonal matrix).  This means that sets of terms or documents can be represented as d-dimensional vectors. Using the cosine of the angle between these vectors, there is  now an easy-to-calculate similarity measure between any two sets of terms and/or documents.     It can be used in any language because of the way that it’s constructed.
But…there are big problems with LSI:
  1. The resulting dimensions can be very difficult to interpret so there are mistakes.  It’s unclear what the resulting similarities between terms really means.  
  2. The input is a bag-of-words so we don’t have any text structure information.
  3. A compound term (bull-headed) is treated as 2 terms.
  4. Ambiguous terms create noise in the vector space
  5. There’s no way to define the optimal dimensionality of the vector space
  6. There’s a time complexity for SVD in dynamic collections
So instead we use Probabilistic latent semantic indexing (PLSI).
It is better:
  1. It has a more robust statistical foundation and provides a proper generative data model
  2. It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) – this makes it far more flexible
  3. It can deal with domain specific synonymy and polysemous words
LSI is different to PLSI:
  • It analyses co-occurrences of words and documents
  • It gives us what terms give us a topic (term by topic matrix – topic is also called an “aspect”), and which topics are in which documents term by document matrix)
  • It uses aspects to create queries
  • LSI uses the Gaussian model (normal distribution) that can generate negative values and you can’t have a negative number of words in a document – PLSI uses a multi-nominal model (probability distribution) which works better.
In short:
  • The order of the words is lost (but results are still good due to word co-occurrence)
  • Documents can be represented by numeric vectors in a space of words
  • It’s an unsupervised method (no human involvement necessary)
  • It retrieves topics
  • It uses aspects of LSI 
  • Each query uses the cosine similarity metric to find the similarity between vectors.
Thomas Hoffman invented it, you can read his SIGIR paper here mini rdf Why PLSI is prefered over LSIif you want more information.
It is currently used not only in standard IR but in collaborative filteringmini rdf Why PLSI is prefered over LSI for example and blog searchmini rdf Why PLSI is prefered over LSI.
LSI is still a fundamental standard techniques used, but PLSI is being preferred for obvious reasons.  Saying that a search engine is using LSI is like saying that a car is using petrol/gas.  It’s not environmentally friendly and it’s expensive.  LSI also has drawbacks and newer techniques are being used to counter these.

Post to Twitter Tweet This Postmini rdf Why PLSI is prefered over LSI

Related Posts:


1 Trackbacks/Pingbacks

  1. More LSI amusement | Science for SEO 12 04 09

Your Comment






© 2009-2010 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.