When you read papers or patents to do with information retrieval, sometimes the authors assume you know what some of the things they use are all about and don’t explain it.
I have written a tutorial on the Search Engine Index
, and also on Search Engine Spiders
and Clustering
. Those should help you out too.
Here is a short glossary of things of those that show up quite a lot in computing papers about IR (designed to be super simple):
Corpus (pl. Corpora): collection of documents used to test on or use for real.
inverted index: has terms as keys mapped to the documents that they refer to.
Inverted-vector index: index that has terms mapped to document URI objects which represent the documents in which all the terms occur in as well as the URI objects.
Deterministic algorithm: This is basically an algorithm which behaves in a predictable way, going through the same sequence of states.
Non-deterministic algorithm: It has one or more places where it can deviate off to a number of other possible states that have not been specified. It’s unpredictable.
Probabilistic algorithm (also called randomized algo): It’s random (unpredictable), and the result depends on chance.
Complexity: it’s the minimum resources needed to execute a program or an algorithm, For example this could be time, processing power…
Dichotomic search: This is a method where search is performed by selecting 2 distinct alternatives at each stage.
Directed graph: each edge can be followed from one vertice to the next, and the vertices are ordered in pairs. Edge is the connection that exists between 2 vertices) and “Vertice” is an item in the graph. ”In-degree” refers to the number of edges coming into a vertex.
Weighted graph: each edge has a weight assigned to it. The weight of the path is the sum of its edges. It’s basically a network which can be directed or undirected.
Sparse matrix: there are hardly any non-zero entries
Random sampling: you select a sample of data at random to analyse in order to solve a problem involving the entirety of the data.
Zipf’s law: The probability of occurrence of words starts high and then decreases. Few occur very often while many others occur rarely.
E-measure: it’s an evaluation metric combining recall (fraction of relevant documents effectively retrieved) and precision (fraction of retrieved documents that are relevant).
Entropy:It comes from physics where it’s the “degree of chaos”. It means in computing that there is randomness, no pattern or organisation.
Regular expression (Regex): it’s a pattern made up of particular symbols (like *,^,..) that is context-independent syntax. It helps in pattern matching in text documents for example. This captures HTML tags: <TAGb[^>]*>(.*?)</TAG>

