data

Module for reading and writing case files.

The read_* methods provide reading of cases in various formats from dataset. For converting dataset between formats, use the appropriate create_dataset_* function. It is also possible to provide custom conversion functions to the create_dataset() function.

The following formats are supported for dataset conversion:

HTML:Expected formatted similarly to AIR dataset reports for conversion to cases. Conversion to text/dependencies should work regardless.
Text:Raw text. Anything within p if extracted from HTML.
Preprocessed text:
 Processed using the default parameters from preprocess.preprocess_text().
Dependencies:As defined by the stanford dependency parser.

The module expects to work with datasets structured so that each category is in a separate subfolder named after the category.

Author:Kjetil Valle <kjetilva@stud.ntnu.no>
data.create_dataset(base_path, target_path, processing_fn)

Crate a new dataset in target_path from data in base_path. Every file in base_path is processed using function processing_fn, and then stored under target_path.

base_path is the path to the data to be processed and turned into new dataset. target_path is the name/path of the new dataset. The processing_fn function is used for processing each document. processing_fn needs to have string as both input and output.

data.create_dataset_html_to_case(base_path, target_path)
Convert dataset: HTML to CBR case
data.create_dataset_html_to_text(base_path, target_path)
Convert dataset: HTML to text
data.create_dataset_text_to_dependencies(base_path, target_path)
Convert dataset: Text to stanford dependencies
data.create_dataset_text_to_preprocessed_text(base_path, target_path)
Convert dataset: Text to preprocessed text
data.fix_ascii(path)
Test whether documents in dataset are ascii encoded
data.get_file_names(in_path)
Retrieve list of file/case names matching dataset path.
data.pickle_from_file(filename, suppress_warning=False)
Read file and unpickle contents
data.pickle_to_file(data, filename)
Pickle contents and dump to file
data.read_file(file_path, verbose=False)
Fetch contents of a single file from file_path.
data.read_files(in_path, unpickle_content=False, verbose=False)

Read dataset/cases from files.

Files are read recursively from in_path, with subdirectory names used as labels.

data.test_ascii(path='../data/air/reports_text')
Test whether documents in dataset are ascii encoded
data.write_to_file(data, filename, mode='a+')
Dump data to file

Previous topic

Modules

Next topic

report_data

This Page