With the advent of Bing and the fact that it works really well, people are starting to take Microsoft seriously in the search domain. The previous offerings were not as strong as Google and with Google being such a behemoth in search, it was going to take a fair bit to even cause a tremor. Several interesting engines have appeared this year and last year. Although some of them were actually really good, they weren’t able to compete successfully. At least not at this time. Bing was released and to be frank, I think it’s very good.
The image search is phenomenal and makes Google’s image search interface look very dated. I was never a fan of the Google interface as a whole, although I understand that minimalism is good. I prefer the polished layout and look and feel of Bing.
Actually to poke a little fun, it’s like Google is the PC and Bing is the Mac.
Google have a long and illustrious research history and so does Microsoft, in fact Microsoft has a longer one. Some of the brightest minds in computing work in those labs and they really do deliver. The results on Google and Bing are not dissimilar but both engines function differently. One obvious feature is the aggressive stopword removal on Bing (a little too vicious sometimes), there are others too. Anyway, we all know that the 2 are designed differently and the how and what of that is out of the scope of this post. What I want to do is draw some attention to some of the SIGIR ’09 papers from Microsoft that I’ve finally got round to reading.
Before I continue I will stress that the fact that particular methods have been researched by Microsoft, does not mean that they were implemented in Bing, so lets not get ahead of ourselves. These papers are interesting because they show what kind of direction (scientifically) the company is moving in. Also, it’s interesting if you’re into search. All of the papers are freely available so you can read them fully if you so wish
The idea presented here is that usually engines treat all anchor text as equal, but here it is suggested that anchor text from different domains have more value as do those that are from related sites. They specifically look at relationships between the source pages of the anchor texts.
There are 2 models:
Site independent model: When there are several links from one site with the same anchor text, they are considered duplicated and only counted once.
Site relationship model: Looks at the relevance of a site and assigns a lower weight if it is related to the target site. Links from unrelated websites are considered more valuable.
“The site-independent model assumes that different hyperlinks coming from the same Web site are identical; while the site relationship model further considers the relationships between Web sites (including the relationship between source site and destination site, and the relationship between different sources sites). The weight assigned to an anchor text is adjusted accordingly. Our experimental results show that these two new models can outperform the baseline model, which uses hyperlinks as if they are independent. In addition, the site relationship model performs the best.”
(They tested this using BM25 btw)
This area of research is valuable because identifying named-entities in queries allows for a more accurate query disambiguation, which in tun means that user intent can be better identified. NERQ is a hard task because queries are usually short and the same entities do not necessarily occur commonly in queries.
“In this paper, we propose a new probabilistic approach to NERQ using query log data. Without loss of generality, a query having one named entity is represented as a triple(e, t, c), where e denotes named entity, t context of e, and c class of e. Note that t can be empty (i.e. no context), e.g. “harry potter”. Then the goal of NERQ here becomes to find the triple (e, t, c) for a given query q, which has the largest joint probability Pr(e, t, c). The joint probability is factorized and then estimated by using query log and LDA.”
“Classes of named entities can be, for instance,“Book”, “Movie”, “Game”, and “Music”. Given query “harry potter walkthrough”, we detect “harry potter” as a named entity and assign “Game” to it as the most likely class, “Movie” and “Book” as less likely classes, and “Music” as unlikely class. This is because the context “walkthrough” strongly indicates that “harry potter” here is more likely to mean the Harry Potter game. (If the query is only “harry potter”, then “Book” and “Movie” will be more plausible.)”
This is a series of slides in the shape of a tutorial which I found really interesting and easy to digest. This is the Microsoft “This is what we’ve been doing, this was great, this was rubbish, we’re gonna do more of this and that”. Good stuff, I urge to read it, it was published on the 17th June 09.
One of the slides sums it all up nicely:
“IR Increasingly Relies on ML
• General shift from heuristics to formal probabilistic models.
• More recent shift to discriminative models where previous models serve as input features.
• Salient computational features:
–Massive amounts of documents.
–Nearly infinite variety in expressing an information need.
–Huge amount of user-generated data.”
And so I conclude….
By reading about the research that Microsoft are doing in IR and by finding out about the sorts of things that they find interesting, we can start to understand what their values are, where their goals lie, what sort of thing they are looking to achieve. This may not give us insight into how to get a site to rank #1 in Bing, but I think that it does something more important than that for us. It educates, enlightens and ultimately opens your eyes to what is up ahead. We can never be sure of how one of these big engines works, but by understanding the science, we can begin to join some dots which allow us to think about the web and how our sites are structured far more than if we read a post telling us what to do to rank.