Microsoft and Information Retrieval

With the advent of Bing and the fact that it works really well, people are starting to take Microsoft seriously in the search domain. The previous offerings were not as strong as Google and with Google being such a behemoth in search, it was going to take a fair bit to even cause a tremor. Several interesting engines have appeared this year and last year. Although some of them were actually really good, they weren’t able to compete successfully. At least not at this time. Bing was released and to be frank, I think it’s very good.

The image search is phenomenal and makes Google’s image search interface look very dated. I was never a fan of the Google interface as a whole, although I understand that minimalism is good. I prefer the polished layout and look and feel of Bing.

Actually to poke a little fun, it’s like Google is the PC and Bing is the Mac.

Google have a long and illustrious research history and so does Microsoft, in fact Microsoft has a longer one. Some of the brightest minds in computing work in those labs and they really do deliver. The results on Google and Bing are not dissimilar but both engines function differently. One obvious feature is the aggressive stopword removal on Bing (a little too vicious sometimes), there are others too. Anyway, we all know that the 2 are designed differently and the how and what of that is out of the scope of this post. What I want to do is draw some attention to some of the SIGIR ’09 papers from Microsoft that I’ve finally got round to reading.

Before I continue I will stress that the fact that particular methods have been researched by Microsoft, does not mean that they were implemented in Bing, so lets not get ahead of ourselves. These papers are interesting because they show what kind of direction (scientifically) the company is moving in. Also, it’s interesting if you’re into search. All of the papers are freely available so you can read them fully if you so wish icon smile Microsoft and Information Retrieval

Using Anchor Texts with Their Hyperlink Structure for Web Search

The idea presented here is that usually engines treat all anchor text as equal, but here it is suggested that anchor text from different domains have more value as do those that are from related sites. They specifically look at relationships between the source pages of the anchor texts.

There are 2 models:

Site independent model: When there are several links from one site with the same anchor text, they are considered duplicated and only counted once.

Site relationship model: Looks at the relevance of a site and assigns a lower weight if it is related to the target site. Links from unrelated websites are considered more valuable.

“The site-independent model assumes that different hyperlinks coming from the same Web site are identical; while the site relationship model further considers the relationships between Web sites (including the relationship between source site and destination site, and the relationship between different sources sites). The weight assigned to an anchor text is adjusted accordingly. Our experimental results show that these two new models can outperform the baseline model, which uses hyperlinks as if they are independent. In addition, the site relationship model performs the best.”

(They tested this using BM25 btw)

Named Entity Recognition in Query

This area of research is valuable because identifying named-entities in queries allows for a more accurate query disambiguation, which in tun means that user intent can be better identified. NERQ is a hard task because queries are usually short and the same entities do not necessarily occur commonly in queries.

“In this paper, we propose a new probabilistic approach to NERQ using query log data. Without loss of generality, a query having one named entity is represented as a triple(e, t, c), where e denotes named entity, t context of e, and c class of e. Note that t can be empty (i.e. no context), e.g. “harry potter”. Then the goal of NERQ here becomes to find the triple (e, t, c) for a given query q, which has the largest joint probability Pr(e, t, c). The joint probability is factorized and then estimated by using query log and LDA.”

“Classes of named entities can be, for instance,“Book”, “Movie”, “Game”, and “Music”. Given query “harry potter walkthrough”, we detect “harry potter” as a named entity and assign “Game” to it as the most likely class, “Movie” and “Book” as less likely classes, and “Music” as unlikely class. This is because the context “walkthrough” strongly indicates that “harry potter” here is more likely to mean the Harry Potter game. (If the query is only “harry potter”, then “Book” and “Movie” will be more plausible.)”

Machine Learning in IR: Recent Successes and New Opportunities

This is a series of slides in the shape of a tutorial which I found really interesting and easy to digest. This is the Microsoft “This is what we’ve been doing, this was great, this was rubbish, we’re gonna do more of this and that”. Good stuff, I urge to read it, it was published on the 17th June 09.

One of the slides sums it all up nicely:

IR Increasingly Relies on ML
•General shift from heuristics to formal probabilistic models.
•More recent shift to discriminative models where previous models serve as input features.
•Salient computational features:
–Massive amounts of documents.
–Nearly infinite variety in expressing an information need.
–Huge amount of user-generated data.

“IR Increasingly Relies on ML

• General shift from heuristics to formal probabilistic models.

• More recent shift to discriminative models where previous models serve as input features.

• Salient computational features:

–Massive amounts of documents.

–Nearly infinite variety in expressing an information need.

–Huge amount of user-generated data.”

And so I conclude….

By reading about the research that Microsoft are doing in IR and by finding out about the sorts of things that they find interesting, we can start to understand what their values are, where their goals lie, what sort of thing they are looking to achieve. This may not give us insight into how to get a site to rank #1 in Bing, but I think that it does something more important than that for us. It educates, enlightens and ultimately opens your eyes to what is up ahead. We can never be sure of how one of these big engines works, but by understanding the science, we can begin to join some dots which allow us to think about the web and how our sites are structured far more than if we read a post telling us what to do to rank.

Related Posts:


8 Comments Add Yours ↓

  1. 1

    At the Bing launch, there were several things I noted that they are taking advantage of for improved IR. There was also eye candy (mouse hover instant video playback, no click!). Still, I think machine learning, tweaking to better handle being a target after critical mass (assuming they get there with that) they will have a dangerous search engine. I love that!

  2. 2

    What I find most interesting about this is the “Site relationship model”. I am having a hard time understanding why a site that isn’t related to a target site would carry more relavant anchor text. It goes by simple logic that a site that is relevant to the target site would “understand” said topic better and would thus apply better quality anchor text.

    Are they assuming that a site that is less relevant would have better anchor text because they would be introducing general concepts to their audiance that might not otherwise be familiar?

    Anyways, excellent post!

  3. CJ #
    3

    I think it’s exactly what you said. It’s important to think laterally about problems in computing. I’m always surprised how many times common sense lets me down! The solution is usually more subtle. I guess that if a site about Tea links to your blog post about Import tax, it would be related thematically but not from a vocab point of view so there’s that to consider as well.

  4. CJ #
    4

    I love it too! It shakes things up.

  5. 5

    >The image search is phenomenal

    I agree – I’ve always thought the UX of the image search from “Live” had been better. I do remember an interesting talk from Nick Craswell of MSFT’s research group in 2007 regarding it but seem to have lost the slides.

    Guy

  6. 6

    >but seem to have lost the slides

    I ‘Binged’ (???) for them and have possibly found them, though Powerpoint won’t let me view them.
    http://bit.ly/qXM7L

  7. 7

    Try this one:
    http://bit.ly/kEuZf

  8. CJ #
    8

    Thank you matey!


1 Trackbacks/Pingbacks

  1. Daily Links for Friday, July 3rd, 2009 03 07 09

Your Comment






© 2009-2013 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.8.1, a WordPress plugin for Twitter.