Represents a Text that will be processed to extract its information.
A text is composed by paragraphs, paragraphs by sentences and sentences by tokens:
Tokens represent a word in the text. These objects store information like:
- If the token is a stop word (word without sense).
- If the token is a main name inside the sentence.
- The stemed word
- The Part-Of-Speech tag of the token.
- A list of relations with other similar tokens.
The organization in paragraphs, sentences and tokens is performed by specific methods
depending on the chosen organization.
The information extracted from the text is stored in the IEtext object. There are
different kinds of information that will be obtained by dedicated methods:
- Phrases identified in the text.
- Features: identifier-value pairs extracted from the text.
- Topics: combining phrases and features a topic can be associated to a text. A topic is a classification of the text.
Phrases and Features are stored using the objects implemented in the info subpackage. That package
stores three objects that aid in the representation of the extracted information:
- PhraseInfo: stores extracted phrases.
- FeatureInfo: stores extracted features.
- WeightedRelation: represents a weighted relation between two tokens. These relations are found by the glossary and thesaurus methods.
Following picture illustrates the hole organization: