Today we have quite a wide choice of information retrieval systems ranging from Google, Bing, Kosmix, Duck Duck Go, Cognition and so many more. We all have the engine we use 90% of the time I’d estimate, but is it really the best engine? It might be the best for us personally or for a particular task but is any one engine better than all the others? It’s very subjective, and not easy to measure. The IR community has been busy with this problem for a while now so I thought that with all of the movements in the search space at the moment, it would be relevant to look at how we can determine which IR system is better.
Measures:
Typically we use “Recall” (% of documents correctly retrieved) and “Precision” (% of retrieved documents that are relevant) to determine how good a system is. Basically we can think of these measures as “information need” and “relevance”.
Precision = relevant | retrieved
Recall = retrieved | relevant
These measures are not straightforward though because relevance for example is influenced by context, intention, previous documents read or seen and so it remains unmeasurable to a large degree due to its fluidity. It is highly subjective. Additionally I find interesting that the classical evaluation measure requires the system to retrieve all (but only all) the relevant set of documents. Our information seeking behaviour has changed enough for us to already recognise that related information can sometimes be even more useful than the original intention of our search. Search is increasingly exploratory and cognitive research into our search intention is helping this area develop. Precision is very subjective. How far are the evaluations we carry out valid? Putting users in a lab and asking them to use a system for an amount of time is an unnatural situation for both the user and the system. Using logs requires knowing when the user has judged the search successful which isn’t as straightforward as it sounds. Usually as precision decreases recall increases but not always.
The interface:
The search interface is another factor that sometimes goes forgotten. How far does it influence the user? It’s possible that a less reliable system is deemed better by the user simply because the interface is more intuitive. A good interface helps you express your information need and understand the results.
Test collections:
We can evaluate systems using test collections, which also need to be evaluated themselves. The TREC conference takes place every year and more collections are made available. The purpose of the conference is to establish good evaluation methods for search systems. Often system performance is collection dependent, meaning that it might do a lot better on some sets than on others, seem as it might do better with certain types of queries than others. Test collections include TDT, CACM, ISI and more.
Human evaluation:
We each use different engines and have some conviction as to which we feel works best for us. Human judgement is useful but because it can’t be tested on a large (billions) test set, and that everyone is different, it makes it unfair and accuracy is unclear. What about all the documents that aren’t labelled? There are a lot of results I skip and then find the exact right one a page or so down. I found exactly what I was looking for but I had to work for it, is this successful? I think so because my information need was answered but a system that would have given me only this result would have been 100% effective I think.
Most searchers use the “significance test”. Empirical research says that there should be at least 25 queries of the same type to test whether A is better than B for that particular kind of query on that particular subject. TREC tries to go for 50 queries at least. It’s hard to make this consistent, and it requires a lot of time as well. The t-test, Wilcoxon signed-rank test and the sign test are example measures for significance tests, pairs of measurements are considered.
More info on evaluation:
EvaluatIR.org provides an online framework for evaluation.
Check out CIKM video lectures for loads more about evaluating search systems.




