The GATE interface
Here we will be covering mostly what information extraction (IE) is because it isn’t given nearly as much attention as information retrieval (IR). The differences are highlighted but for more in-depth information on IR check Mannings online book
.
I’ve provided an IR glossary
and also a tutorial on document clustering
which you might find useful.
The difference:
“Information Extraction is not Information Retrieval: Information Extraction differs from traditional techniques in that it does not recover from a collection a subset of documents which are hopefully relevant to a query, based on key-word searching (perhaps augmented by a thesaurus). Instead, the goal is to extract from the documents (which may be in a variety of languages) salient facts about prespecified types of events, entities or relationships. These facts are then usually entered automatically into a database, which may then be used to analyse the data for trends, to give a natural language summary, or simply to serve for on-line access.” (GATE)
What is information extraction:
Information Extraction (IE) systems analyse unrestricted text in order to extract information about pre-specified types of events, entities or relationships. In other words, information extraction is all about deriving structured factual information from unstructured text. It uses techniques currently applied to Text Mining. In works by combining Natural Language Processing tools, lexical resources and semantic constraints, and can be extremely effective.
Unstructured text:
Unstructured data includes web pages, text documents, office documents, presentations, emails,…It doesn’t have a data model so it can’t be easily processed by a machine. In contrast structured data is either annotated or in databases. The semantic web aims amongst other things to make all of this data machine readable by tagging it up appropriately.
“Over 95 percent of the digital universe is unstructured data. In organizations, unstructured data accounts for more than 80 percent of all information.” (Jonathan Martin, HP
)
It’s also referred to as “dark matter“.
“Most of the stuff in clusters of galaxies is invisible and, since these are the largest structures in the Universe held together by gravity, scientists then conclude that most of the matter in the entire Universe is invisible. This invisible stuff is called ‘dark matter’.” (NASA)
Most of the stuff on the web is invisible and, since these unstructured documents are the largest data type in the web Universe held together by links, scientists then conclude that most of the data in the entire web is invisible. This invisible stuff is called ‘dark matter’.
How Information Extraction works:
Documents are tagged, each one is processed to find (extract) Entities and Relationships (facts or events) that are likely to be meaningful and content-bearing. This information is more concise and more precise for use in the mining process. Using relationships provides more meaningful information related to the domain of the documents.
Example of a tagged sentence (Brill):
Miró married Pilar Juncosa in Palma de Mallorca on October 12, 1929;
Miró/NNP married/VBD Pilar/NNP Juncosa/NNP in/IN Palma/NNP de/FW Mallorca/NNP on/IN October/NNP 12/CD ,/, 1929/CD ;/:
The meaning of the tags can be found here
.
Named entity recognition (NER):
This about identifying textual information relating to people, organisations, places, brands, products and so
on. These are typically nouns and proper nouns. This sounds pretty easy, but it’s not because some named-entities are not obvious, like the brand “Orange” for example.
Feature extraction can be used to improve NER. Every word can have many different features.
Relation extraction:
This helps IR systems to answer particular information-seeking queries. They run into trouble when the data is complex and a multiude of variables are involved. Combinations of different variables can be used to get around ths, and techniques such as LSA for example are out to good use.
Relations can be:
Implicit: they imply that there is understanding of the text
Explicit: They explicitly spelled out
Check the GATE example
to see what it looks like when it’s tagged up.
Output:
The output is structured information which can be strored in a database for further processing or used directly in another system. This is very useful when you consider the amount of “dark matter”!
It’s used in information retrieval to make the system more precise. It can also be used in summarization systems and also to auto-fill databases from text.
In short:
Most of out data is unstructured
IR is there to find relevant documents
IE is there to extract relevant information from the documents
Software:




This is a very interesting article. Enjoyed the read. Thanks
The biggest misconception prelevant even amongst most of the highly educated masses. Good explanation. I’ll keep this topic as my debate for next week’s seminar.