Information Extraction is not Information Retrieval

400px GATE5 main window Information Extraction is not Information Retrieval

The GATE interface

Here we will be covering mostly what information extraction (IE) is because it isn’t given nearly as much attention as information retrieval (IR). The differences are highlighted but for more in-depth information on IR check Mannings online bookmini rdf Information Extraction is not Information Retrieval.

I’ve provided an IR glossarymini rdf Information Extraction is not Information Retrieval and also a tutorial on document clusteringmini rdf Information Extraction is not Information Retrieval which you might find useful.

The difference:

“Information Extraction is not Information Retrieval: Information Extraction differs from traditional techniques in that it does not recover from a collection a subset of documents which are hopefully relevant to a query, based on key-word searching (perhaps augmented by a thesaurus). Instead, the goal is to extract from the documents (which may be in a variety of languages) salient facts about prespecified types of events, entities or relationships. These facts are then usually entered automatically into a database, which may then be used to analyse the data for trends, to give a natural language summary, or simply to serve for on-line access.” (GATE)

What is information extraction:

Information Extraction (IE) systems analyse unrestricted text in order to extract information about pre-specified types of events, entities or relationships.  In other words, information extraction is all about deriving structured factual information from unstructured text.  It uses techniques currently applied to Text Mining. In works by combining Natural Language Processing tools, lexical resources and semantic constraints, and can be extremely effective.

Unstructured text:

Unstructured data includes web pages, text documents, office documents, presentations, emails,…It doesn’t have a data model so it can’t be easily processed by a machine.  In contrast structured data is either annotated or in databases.   The semantic web aims amongst other things to make all of this data machine readable by tagging it up appropriately.

“Over 95 percent of the digital universe is unstructured data. In organizations, unstructured data accounts for more than 80 percent of all information.” (Jonathan Martin, HPmini rdf Information Extraction is not Information Retrieval)

It’s also referred to as “dark matter“.

“Most of the stuff in clusters of galaxies is invisible and, since these are the largest structures in the Universe held together by gravity, scientists then conclude that most of the matter in the entire Universe is invisible. This invisible stuff is called ‘dark matter’.” (NASA)

Most of the stuff on the web is invisible and, since these unstructured documents are the largest data type in the web Universe held together by links, scientists then conclude that most of the data in the entire web is invisible. This invisible stuff is called ‘dark matter’.

How Information Extraction works:

Documents are tagged, each one is processed to find (extract) Entities and Relationships (facts or events) that are likely to be meaningful and content-bearing.  This information is more concise and more precise for use in the mining process.  Using relationships provides more meaningful information related to the domain of the documents.

Example of a tagged sentence (Brill):

Miró married Pilar Juncosa in Palma de Mallorca on October 12, 1929;

Miró/NNP married/VBD Pilar/NNP Juncosa/NNP in/IN Palma/NNP de/FW Mallorca/NNP on/IN October/NNP 12/CD ,/, 1929/CD ;/:

The meaning of the tags can be found heremini rdf Information Extraction is not Information Retrieval.

Named entity recognition (NER):

This about identifying textual information relating to people, organisations, places, brands, products and so

on.  These are typically nouns and proper nouns.  This sounds pretty easy, but it’s not because some named-entities are not obvious, like the brand “Orange” for example.

Feature extraction can be used to improve NER.  Every word can have many different features.

Relation extraction:

This helps IR systems to answer particular information-seeking queries.  They run into trouble when the data is complex and a multiude of variables are involved. Combinations of different variables can be used to get around ths, and techniques such as LSA for example are out to good use.

Relations can be:

Implicit: they imply that there is understanding of the text

Explicit: They explicitly spelled out

Check the GATE examplemini rdf Information Extraction is not Information Retrieval to see what it looks like when it’s tagged up.

Output:

The output is structured information which can be strored in a database for further processing or used directly in another system.  This is very useful when you consider the amount of “dark matter”!

It’s used in information retrieval to make the system more precise.  It can also be used in summarization systems and also to auto-fill databases from text.

In short:

Most of out data is unstructured

IR is there to find relevant documents

IE is there to extract relevant information from the documents

Software:

GATEmini rdf Information Extraction is not Information Retrieval

LingPipemini rdf Information Extraction is not Information Retrieval

OpenCalaismini rdf Information Extraction is not Information Retrieval

Stanford NERmini rdf Information Extraction is not Information Retrieval

Related Posts:


3 Comments Add Yours ↓

  1. 1

    This is a very interesting article. Enjoyed the read. Thanks

  2. 2

    The biggest misconception prelevant even amongst most of the highly educated masses. Good explanation. I’ll keep this topic as my debate for next week’s seminar.

  3. 3

    Interesting read…


2 Trackbacks/Pingbacks

  1. 5 Common Information Retrieval Myths | Search Engine People | Toronto 23 06 09
  2. Commom Information Retrieval Myths « Informação e Tecnologias 01 07 09

Your Comment






© 2009-2013 Science for SEO All Rights Reserved -- Copyright notice by Blog Copyright

SEO Powered by Platinum SEO from Techblissonline

Twitter links powered by Tweet This v1.8.1, a WordPress plugin for Twitter.