The Lemur toolkit is a natural language processing and information retrieval toolkit. Having a go on this is a nice way of seeing some IR technologies functioning first hand, rather than guessing on a major SE to observe the phenomenon.
It supports all major languages, performs stemming using Porter and Krovetz, indexes loads of file formats, uses part-of-speech tagging and named entity recognition, and has an API of course (C++, C# and Java).
- Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
- Relevance- and pseudo-relevance feedback
- Wildcard term expansion (using Indri)
- Passage and XML element retrieval
- Cross-lingual retrieval
- Smoothing via Dirichlet priors and Markov chains
- Supports arbitrary document priors (e.g., Page Rank, URL depth)
Best of all, it’s free! There’s a set of tutorials to get you started here.