Package jcolibri.extensions.textual.IE

This package stores the extension for Textual CBR.

See:
          Description

Class Summary
IEutils Utility functions for the IE extension.
 

Package jcolibri.extensions.textual.IE Description

This package stores the extension for Textual CBR.

The extension follows the layered model proposed by Mario Lenz, that divides the processing of texts in serveral stages:


Representation

jCOLIBRI has the generic Text object to store texts into cases. These objects are managed by the methods of the jcolibri.extensions.textual package.

This object is not enough to manage the required information defined by the Lenz steps. So this package defines a subclass named IEText (texts for Information Extraction).

An IE text recives its content as String and later a method will organize this content. This way, a text is composed by paragraphs, paragraphs by sentences and sentences by tokens:

Tokens represent a word in the text. These objects store information like:

The organization in paragraphs, sentences and tokens is performed by specific methods depending on the chosen implementation.

The information extracted from the text by the methods is stored in the IEtext object. There are different kinds of information that will be obtained by dedicated methods:

Phrases and Features are stored using the objects implemented in the representation.info subpackage. That package stores three objects that aid in the representation of the extracted information:

Following picture illustrates the hole organization:


Methods

jCOLIBRI includes several implementations of the Lenz layers. Some methods have been implemented in a general way. Other methods use the Maximum Entropy algorithms implemented in the OpenNLP package. Finally, another group of methods use the GATE library for text processing.

Each group of methods can only work with certain textual objects. The OpenNLP implementation has its own specialization of IEText named IETextOpenNLP, and the GATE implementation has the IETextGate object:

And this table organizes the available methods:

Implementation

OpenNLP GATE Generic

Compatible Textual Object

IETextOpenNLP IETextGate IETextOpenNLP,
IETextGate,
IEText

Package

jcolibri.extensions.textual.IE.opennlp jcolibri.extensions.textual.IE.gate jcolibri.extensions.textual.IE.common
Layers
Organize Text OpennlpSplitter GateSplitter
Keyword: StopWords StopWordsDetector
Keyword: Stemmer TextStemmer
Keyword: POS tagging OpennlpPOStagger GatePOStagger
Keyword: Main Names OpennlpMainNamesExtractor
Phrase GatePhrasesExtractor PhrasesExtractor
Glossary GlossaryLinker
Thesaurus ThesaurusLinker
Feature Value GateFeaturesExtractor FeaturesExtractor
Domain Structure DomainTopicClassifier
Information Extraction BasicInformationExtractor

Computing similarity

The methods of the IE extension extract information from texts and store it into the other attributes of the case (see BasicInformationExtractor). These attributes can be compared using normal similarity functions.

Textual attributes can be also be compared using specific similarity functions located in the package: jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.

Some of them can be only be applied to IEText objects (or its subclasses) because require information stored in the tokens:

Cosine jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.CosineCoefficient
Dice jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.DiceCoefficient
Jaccard jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.JaccardCoefficient
Overlap jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.OverlapCoefficient

There is a similarity function that uses Apache Lucene to compare texts. This function can be applied to any Text subclass as not require any kind of extracted information

LuceneTextSimilarity jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.LuceneTextSimilarity

Examples

Test 13 shows how to use the textual CBR extension of jCOLIBRI2


GAIA - Group for Artificial Intelligence Applications
http://gaia.fdi.ucm.es