preprocess

Toolbox of functions for preprocessing text.

The module contains methods for a variety of preprocessing tasks, such as filtering out words with special characters, stemming, stop-word removal, case folding and more, as well as functions for splitting text into lists of tokens or sentences. Use preprocess_text() and preprocess_token() for full preprocessing.

Extraction of within-sentence word dependnecies is also available through the extract_dependencies() function, which works as an interface to the stanford_parser module.

The Natural Language Toolkit (NLTK) is used for most of the heavy lifting.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>
preprocess.extract_dependencies(text)

Creates a dictionary with dependency information about the text.

Some sentences may be too long for the parser to handle, leading to exhaustion of the java heap space. If this happens the sentence is skipped and its dependencies omitted.

preprocess.filter_tokens(tokens, min_size=0, special_chars=False)

Filter list of tokens.

Any token with length below min_size is removed. If special_chars is True, all tokens containing special chars will also be removed.

preprocess.fold_case(tokens)
Fold tokens to lower case
preprocess.is_stop_word(token)
Check whether a particular token is a stop word
preprocess.preprocess_text(text, do_stop_word_removal=True, do_stemming=True, fold=True, specials=True, min_size=3)
Perform preprocessing steps on the input text.
preprocess.preprocess_token(token, do_stop_word_removal=True, do_stemming=True, fold=True, specials=True, min_size=3)
Perform preprocessing on a single token.
preprocess.remove_stop_words(tokens)
Remove all stop words from list of tokens
preprocess.stem(tokens)
Return list of stemmed tokens
preprocess.tokenize_sentences(text)
Returns a list of sentences (as strings) from the input text
preprocess.tokenize_tokens(text)
Return list of tokens for text string input

Previous topic

report_data

Next topic

stanford_parser

This Page