Since we talked about long-tail queries earlier in the week, I was inspired to look a little more at query analysis and how search engines could deal with the troublesome long-tail ones.
You know when every time you see a pair of shoes you really like and it’s always the same designer? Well that’s like me and Bruce Croft. Every time I read a paper that I really find really exciting, it’s him! This one is called ”Analysis of Long Queries in a Large Scale Search Log” and it’s by Bendersky and Croft (Umass).
They looked at a huge amount of data from the MSN Search query log excerpt. We’re talking 15M queries and the associated clicks, sample over a period of one month. That is a substantial amount of data.
They already knew from previous work that long natural queries are far more information rich and provide far more information, allowing the user to express complex requests easier. The thing is, as we know from lots of past research, search engines don’t do very well with those. Bendersky and Croft wanted to look at what other ways of processing long queries there were and how search engines might be able to deal with them better.
The reason search engines are not good at retrieval on long-queries is that they don’t have enough natural language parsing capabilities,they can’t pick out the key and the complementary concepts and because of term redundancy.
They defined short queries at less than 4 and long ones at more than 5 and less than 12. Queries above 12 terms were full of noise, like bot created queries for example. 90.3% of the queries in the data were short. This is in line with all the other work done in this area.
They looked at the long ones specifically and classified them into categories:
Questions (QE), Operators (OP) (and, or, not), Composite (CO) (composition of queries: “Persian rug dealers in austin texas”), Non-Composite: noun phrases (NC_NO) (Temple of the full-moon) and verb phrases (NC_VE) (detecting a leak in the pool).
They took into consideration the the mean (meanRR) and maximum reciprocal ranks (maxRR) of the clicks for the query instance and also the mean click position for the query.
They found that their click data showed that indeed, the longer the query, the less effective the retrieval. They didn’t however believe that the decrease in effectiveness was solely due to the query length.
Operators (OP) are longer than the noun phrases (CO, NC_NO), but their performance in terms of meanRR(q) and maxRR(q) is the same (where q is query).
“The questions (QE) and verb phrases (NC_VE) (in terms of maxRR(q)) is worse by, respectively, 6.7% and 4.5% from the average performance of the queries of the same length.”
Abandonment rate is lower for the short (SH) and composite (CO) queries, but it is higher for operator queries (OP) (because it’s either right or wrong). Questions (QE) have a relatively low abandonment rate which means that people are finding something useful. Non-composite queries (NC_NO, NC_VE) have around 60% abandonment rate.
For the longer queries they found that composite queries and noun phrases are more effective than verb phrases and questions.
Query Reduction: eliminate redundancy (bits that make the query difficult for the engine)
Query Expansion: this one is well knows but making them longer means the engine finds it harder to deal with. As the authors note: “the initial retrieval is vital for the success”.
Query Reformulation: terms are substituted by synonyms or more contextual terms, spelling is corrected, and “translation” of the query terms using some form of term association.”
Term and Concept Weighting: It’s done via a Poisson query generation model for information retrieval that allows for term-specific smoothing based on collection statistics. The authors proposed “a supervised method for weighting concepts (determined by noun phrase extraction) in verbose natural language queries. Both methods were shown to be more effective than a standard query-likelihood model for the long queries.”
Query Segmentation: breaking the query up into atomic concepts rather than terms. This means that the techniques above can be applied more effectively.
“In the longer term, specific retrieval methods can be developed, targeting the most common types of queries identifed in the search logs. Arguably, this would yield better results than the existing “one size fits all” approach to retrieval. For instance, a natural language processing approach should be more suitable for the verb phrases and questions, while noun phrase queries might be better served by a syntax agnostic query segmentation.”
Why should you care?
For an SEO, how the search engines deal with queries is very important. If they started using abstractions of the query and then applied other techniques on top, rather than the raw terms, the task of SEO becomes far far more complex than it currently is. When there is substantial research in a particular area, it’s a good hint that enough people think it’s important enough to work on it at length. The search engines do need to improve and it all starts with the user and the query.