Short Information Retrieval glossary

When you read papers or patents to do with information retrieval, sometimes the authors assume you know what some of the things they use are all about and don’t explain it.  

I have written a tutorial on the Search Engine Indexmini rdf Short Information Retrieval glossary, and also on Search Engine Spiders mini rdf Short Information Retrieval glossaryand Clusteringmini rdf Short Information Retrieval glossary.  Those should help you out too.

 

Here is a short glossary of things of those that show up quite a lot in computing papers about IR (designed to be super simple):

 

Corpus (pl. Corpora): collection of documents used to test on or use for real.

inverted index: has terms as keys mapped to the documents that they refer to.

Inverted-vector index: index that has terms mapped to document URI objects which represent the documents in which all the terms occur in as well as the URI objects. 

Deterministic algorithm: This is basically an algorithm which behaves in a predictable way, going through the same sequence of states.

Non-deterministic algorithm: It has one or more places where it can deviate off to a number of other possible states that have not been specified. It’s unpredictable.

Probabilistic algorithm (also called randomized algo): It’s random (unpredictable), and the result depends on chance.

Complexity: it’s the minimum resources needed to execute a program or an algorithm,  For example this could be time, processing power…

Dichotomic search: This is a method where search is performed by selecting 2 distinct alternatives at each stage.

Directed graph: each edge can be followed from one vertice to the next, and the vertices are ordered in pairs. Edge is the connection that exists between 2 vertices) and “Vertice” is an item in the graph.  ”In-degree” refers to the number of edges coming into a vertex.

Weighted graph: each edge has a weight assigned to it.  The weight of the path is the sum of its edges.  It’s basically a network which can be directed or undirected.  

Sparse matrix: there are hardly any non-zero entries

Random sampling: you select a sample of data at random to analyse in order to solve a problem involving the entirety of the data.

Zipf’s law: The probability of occurrence of words starts high and then decreases. Few occur very often while many others occur rarely.

E-measure: it’s an evaluation metric combining recall (fraction of relevant documents effectively retrieved) and precision (fraction of retrieved documents that are relevant).

Entropy:It comes from physics where it’s the “degree of chaos”.  It means in computing that there is randomness, no pattern or organisation. 

Regular expression (Regex): it’s a pattern made up of particular symbols (like *,^,..) that is  context-independent syntax.  It helps in pattern matching in text documents for example.  This captures HTML tags: <TAGb[^>]*>(.*?)</TAG>

Post to Twitter Tweet This Postmini rdf Short Information Retrieval glossary

Related Posts:


1 Trackbacks/Pingbacks

  1. Information Extraction is not Information Retrieval | Science for SEO 19 03 09

Your Comment






© 2009-2010 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.