Modules

This is a brief overview of the different modules. For more details, see each module’s page.

data

Module for reading and writing case files.

The read_* methods provide reading of cases in various formats from dataset. For converting dataset between formats, use the appropriate create_dataset_* function. It is also possible to provide custom conversion functions to the create_dataset() function.

The following formats are supported for dataset conversion:

HTML:Expected formatted similarly to AIR dataset reports for conversion to cases. Conversion to text/dependencies should work regardless.
Text:Raw text. Anything within p if extracted from HTML.
Preprocessed text:
 Processed using the default parameters from preprocess.preprocess_text().
Dependencies:As defined by the stanford dependency parser.

The module expects to work with datasets structured so that each category is in a separate subfolder named after the category.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

report_data

Helper module for data, used to extract problem description and solution parts from cases.

Parses reports formatted in HTML, structured as those in the AIR dataset, and split them into problem description part and solution part of textual CBR cases. Solutions are identified based on section titles in the reports. Titles matching words such as ‘finding’ or ‘conclusion’ are considered as part of the solution.

The remaining parts of the report is by default the problem description.

Author:Gleb Sizov <sizov@idi.ntnu.no>

preprocess

Toolbox of functions for preprocessing text.

The module contains methods for a variety of preprocessing tasks, such as filtering out words with special characters, stemming, stop-word removal, case folding and more, as well as functions for splitting text into lists of tokens or sentences. Use preprocess_text() and preprocess_token() for full preprocessing.

Extraction of within-sentence word dependnecies is also available through the extract_dependencies() function, which works as an interface to the stanford_parser module.

The Natural Language Toolkit (NLTK) is used for most of the heavy lifting.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

stanford_parser

Python interface to the Stanford parser.

The module wrapps the edu.stanford.npl Stanford Parser, which is implemented in Java, using the JPype library.

The StanfordParser class wraps the actual parser, and the parse() function can be used to parse sentences.

Author:Gleb Sizov <sizov@idi.ntnu.no>

freq_representation

Functions for creating frequency based feature vector from text.

The function of interest is text_to_vector(), which creates term frequency (TF) or term frequency-inverse document frequency (TF-IDF) vectors from lists of documents. Results are output in form of a term-document matrix.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

graph_representation

Construct graph representations from text.

The module contains functions from creating networks based on text documents, and for converting the networks into feature-vectors. Feature vectors are created based on node centrality in the text networks.

The following text representations are supported:

random:Will create a network with all distinct terms in the provided document as nodes. Edges are created at random between the nodes, based on provided probabilities.
co-occurrence:Distinct terms in the document are used as nodes. Edges are created between any terms that occurs closely together in the text.
dependency:Words as nodes. Edges represent dependencies extracted from the text using the stanford dependency parser (see the ‘stanford_parser’ module).

The module makes heavy use of the graph module.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

graph

Toolbox module for working with networkx graphs.

Module contains functions for calculating graph centrality, visualizing graphs and finding various network properties, in addition to various other useful functions.

Graph centralities are accessed using the centralities() function, which takes as arguments a graph and the metric to use as a constant of the GraphMetrics class.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

classify

Classification of feature-vectors using KNN classifier.

The KNN class contains the classifier. It can classify() new datapoints as soon as it is properly trained using the train() method. The test() method provides a way to classify many vectors at once, and return the classifiers accuracy compared to a gold standard.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

retrieval

Evaluation method based on case retrieval.

Evaluate lists of cases with evaluate_retrieval(). For each problem description the remaining descriptions are assessed, and the solution corresponding to the best matching description is retrieved. Actual solution is compared to retrieved solution using cosine of solution vectors.

The overall evaluation score is equal to the average solution-solution similarity over the case base.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

evaluation

Module containing methods for evaluating representations.

This module acts as an interface to evaluation against the classify and retrieval modules through the evaluate_classification() and evaluate_retrieval() functions, respectively.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

plotter

Utility functions facititating easy plotting with matplotlib.

Functions of note:

  • plot(): plot a regular plot, given input x,y-coordinates.
  • bar_graph(): plot a horizontal bar graph from x-coordinates and named groups of lists of y-corrdinates.
  • histogram(): plots a histogram from a set of samples and a given numbe of bins.
  • plot_degree_distribution(): plot the degree distribution provided a networkx graph.
Author:Kjetil Valle <kjetilva@stud.ntnu.no>

util

Module containing miscellaneous utility functions without a home anywhere else.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>

Table Of Contents

Previous topic

Architecture / Overview

Next topic

data

This Page