What is information extraction in text mining

Text mining - acquiring knowledge from texts

< text="" mining="" is="" the="" art="" and="" technology="" to="" extract="" knowledge="" from="" text="">>

Gao, Chang and Han

Text mining applies data mining methods to unstructured data. First of all, this chapter delimits the still young research area from related research fields. What follows is a definition of the term and a description of the text mining process. Then the KDD process and the text mining process are compared with each other and it is shown that the processes differ, especially in the preprocessing phase. For this purpose, methods are then described in the next point to structure and analyze free texts. Finally, the individual tasks of text mining are described.


The delimitation of the research area text mining from related areas such as IR, information extraction, text classification, machine learning, web and data mining is not always clear in the literature and in practice.

As already mentioned, IR deals with the retrieval of entire documents in a text collection in order to cover a specific information requirement (see chapter Information Retrieval on the Web). Typically, the user enters keywords to describe the documents they want. As a result, he receives a list of all relevant documents, sorted according to relevance. The central function of an IR system is to store the texts in such a way that they can be found again (cf. Ferber, 2003, p. 21). An IR system can be the basis for the selection and subsequent processing of the texts within a text mining process. The transfer of found texts or text parts into a predefined structure of a database can already take place in the IR system, e.g. by IE.

While IR systems find relevant texts regarding a query, an IE system works on a finer granularity. On the basis of defined rules, IE systems analyze texts from a document collection and extract specific words or parts of text (cf. Cunningham, 2004, p. 3).
Cunningham (2004, p. 1ff) describes IE as “the process deriving disambiguated quantifiable data from natural language texts” for a precisely predefined information requirement.

For the IE, a relatively high amount of preprocessing is required in order to describe the desired data or text parts (also called "snippets"). This can be done by creating formal rules. Another possibility is to manually mark out the desired words or phrases in a first step (annotate). Rules are then automatically generated from this, which can be applied to other documents (cf. Mooney and Nahm, 2003, p. 142).
Unstructured texts are converted into tabular form by IE and usually saved in a database. The elements to be extracted are clearly defined and are geared towards a specific information requirement. Such elements can be e.g. name, location, date for the location of events or name, telephone number, address for the collection of addresses.
As discussed, data mining mainly analyzes structured data from databases. Text mining, on the other hand, tries to convert unstructured data or texts into a structure. (cf. Weiss et. al, 2005, p. 3) Web mining can be understood as the application of data mining or text mining to the web (cf. Mehler and Wolff, 2005, p. 7).

Go to the text mining definition