The extension follows the layered model proposed by Mario Lenz, that divides the processing of texts in serveral stages:
jCOLIBRI has the generic Text object to store texts into cases. These objects are managed by the methods of the jcolibri.extensions.textual package.
This object is not enough to manage the required information defined by the Lenz steps. So this package defines a subclass named IEText (texts for Information Extraction).
An IE text recives its content as String and later a method will organize this content. This way, a text is composed by paragraphs, paragraphs by sentences and sentences by tokens:
Tokens represent a word in the text. These objects store information like:
The organization in paragraphs, sentences and tokens is performed by specific methods depending on the chosen implementation.
The information extracted from the text by the methods is stored in the IEtext object. There are different kinds of information that will be obtained by dedicated methods:
Phrases and Features are stored using the objects implemented in the representation.info subpackage. That package stores three objects that aid in the representation of the extracted information:
Following picture illustrates the hole organization:
Methods
jCOLIBRI includes several implementations of the Lenz layers. Some methods have been implemented in a general way. Other methods use the Maximum Entropy algorithms implemented in the OpenNLP package. Finally, another group of methods use the GATE library for text processing.
Each group of methods can only work with certain textual objects. The OpenNLP implementation has its own specialization of IEText named IETextOpenNLP, and the GATE implementation has the IETextGate object:And this table organizes the available methods:
Implementation |
OpenNLP | GATE | Generic |
Compatible Textual Object |
IETextOpenNLP | IETextGate | IETextOpenNLP, IETextGate, IEText |
Package |
jcolibri.extensions.textual.IE.opennlp | jcolibri.extensions.textual.IE.gate | jcolibri.extensions.textual.IE.common |
Layers | |||
Organize Text | OpennlpSplitter | GateSplitter | |
Keyword: StopWords | StopWordsDetector | ||
Keyword: Stemmer | TextStemmer | ||
Keyword: POS tagging | OpennlpPOStagger | GatePOStagger | |
Keyword: Main Names | OpennlpMainNamesExtractor | ||
Phrase | GatePhrasesExtractor | PhrasesExtractor | |
Glossary | GlossaryLinker | ||
Thesaurus | ThesaurusLinker | ||
Feature Value | GateFeaturesExtractor | FeaturesExtractor | |
Domain Structure | DomainTopicClassifier | ||
Information Extraction | BasicInformationExtractor |
The methods of the IE extension extract information from texts and store it into the other attributes of the case (see BasicInformationExtractor). These attributes can be compared using normal similarity functions.
Textual attributes can be also be compared using specific similarity functions located in the package: jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.
Some of them can be only be applied to IEText objects (or its subclasses) because require information stored in the tokens:
There is a similarity function that uses Apache Lucene to compare texts. This function can be applied to any Text subclass as not require any kind of extracted information
LuceneTextSimilarity | jcolibri.method.retrieve.KNNretrieval.similarity.local.textual.LuceneTextSimilarity |
Test 13 shows how to use the textual CBR extension of jCOLIBRI2