Toolbox of functions for preprocessing text.
The module contains methods for a variety of preprocessing tasks, such as filtering out words with special characters, stemming, stop-word removal, case folding and more, as well as functions for splitting text into lists of tokens or sentences. Use preprocess_text() and preprocess_token() for full preprocessing.
Extraction of within-sentence word dependnecies is also available through the extract_dependencies() function, which works as an interface to the stanford_parser module.
The Natural Language Toolkit (NLTK) is used for most of the heavy lifting.
Author: | Kjetil Valle <kjetilva@stud.ntnu.no> |
---|
Creates a dictionary with dependency information about the text.
Some sentences may be too long for the parser to handle, leading to exhaustion of the java heap space. If this happens the sentence is skipped and its dependencies omitted.
Filter list of tokens.
Any token with length below min_size is removed. If special_chars is True, all tokens containing special chars will also be removed.