Module for reading and writing case files.
The read_* methods provide reading of cases in various formats from dataset. For converting dataset between formats, use the appropriate create_dataset_* function. It is also possible to provide custom conversion functions to the create_dataset() function.
The following formats are supported for dataset conversion:
HTML: | Expected formatted similarly to AIR dataset reports for conversion to cases. Conversion to text/dependencies should work regardless. |
---|---|
Text: | Raw text. Anything within p if extracted from HTML. |
Preprocessed text: | |
Processed using the default parameters from preprocess.preprocess_text(). | |
Dependencies: | As defined by the stanford dependency parser. |
The module expects to work with datasets structured so that each category is in a separate subfolder named after the category.
Author: | Kjetil Valle <kjetilva@stud.ntnu.no> |
---|
Crate a new dataset in target_path from data in base_path. Every file in base_path is processed using function processing_fn, and then stored under target_path.
base_path is the path to the data to be processed and turned into new dataset. target_path is the name/path of the new dataset. The processing_fn function is used for processing each document. processing_fn needs to have string as both input and output.
Read dataset/cases from files.
Files are read recursively from in_path, with subdirectory names used as labels.