With the help of some very cool Tweeters, I found some interesting facts about LSI
and SEO. They are @dpn
For a simple idea of what LSI/A is please read the wikipedia entry
on it. The original paper is here
We will look at Susan Dumais here because she’s actively submitting:
Unsurprisingly all her recent search is in HCI and personalisation, just like Google, and Microsoft and…well everyone:
“The Web changes everything: Understanding the dynamics of Web content”. (WSDM 2009)
“The Influence of Caption Features on Clickthrough Patterns in Web Search” (SIGIR 08)
“To Personalize or Not to Personalize:Modeling Queries with Variation in User Intent” (SIGIR 08)
“Supporting searchers in searching”. (ACL keynote 08)
“Large scale analysis of Web revisitation patterns” (CHI 08)
“Here or There: Preference judgments for relevance”. (ECIR 08)
“The potential value of personalizing search”. (SIGIR 07)
“Information Retrieval In Context” (IUI 07)
Humm…No LSI here.
LSI papers since its introduction:
“Adaptive Label-Driven Scaling for Latent Semantic Indexing” -Quan/Chen/Luo/Xiong (USTC/Reutgers) => exploiting category labels to extend LSI (SIGIR 08)
“Model-Averaged Latent Semantic Indexing”- Efron => Extended with Akaike information criterion (SIGIR 07)
“MultiLabel Informed Latent Semantic Indexing”- Yu/Tresp => using the multi-label informed latent semantic indexing (MLSI) algorithm (SIGIR 05)
“Polynomial Filtering in Latent Semantic Indexing for Information Retrieval”- Kokiopouplou/Saad => LSI based on polynomial filtering (SIGIR 04)
“Unitary Operators for Fast Latent Semantic Indexing (FLSI)” – Hoenkamp => introduces alternatives to SVD that use far fewer resources, yet preserve the advantages of LSI.(SIGIR o1)
“A Similarity-based Probability Model for Latent Semantic Indexing” – Ding => checks the statistical significance of the semantic dimensions (SIGIR 99)
“Probabilistic Latent Semantic Indexing” – Hofmann => “In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model” – (SIGIR 99)
“A Semidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval” Kolda/O’Leary => Replacing low-Rank approximation with truncated SVD approximation (ACM 1998)
The initial theory of LSI and it’s methodology has been extended a great deal throughout the years. The basic LSI method is important as it’s a great way to introduce topic detection and such things. There is a lot more to build on from there though.
There are so many more, some other methods are the Generalized Hebbian Algorithm, Partial least square analysis, Latent Dirichlet Allocation…
@Mindicott reports that “SEO” first appeared in Google in 1998. ”Search engine optimisation”Search engine optimisation + Latent semantic indexing” appeared in 2005.
@dnp quite rightly says that “SVD on huge datasets is BS”.
It appears to me that the LSI that the SEO community refers to is in fact the base model which has been extended and changed and improved quite a bit since 1988. This is quite expected, and therefore when you say “Oh I’m using LSI”, you would be asked which method or if you’ve extended it yourself etc…
Currently the focus on keywords, which is what LSI uses isn’t quite right anymore. I’ve seen a lot of recent research (and so have many of you) talking about semantics. There is lot of work on using semantic units which are not always keywords anyway.
The question should be “What multitudes of methods is Google using?” and “I wonder which LSI method is being used, although I know it is just one factor in a very very large system”. Not “How should I optimise my site for LSI” – I’d ask you which type. I believe that Matt Cutts said something very generic when he said Google used LSI