Experiments
The experiment modules use the rest of the framework to evaluate different versions and aspects of the text network representations.
The various experiments are implemented as functions.
There are four modules.
co_occurrence_experiments for experiments with regular co-occurrence, and higher_order_experiments for higher order co-occurrence networks.
The dependency network representation is tested in dependency_experiments.
The general experiments module contain experiments concerned with several representations, or functions not directly tied to any representation such as, for example, the dataset_stats() function.
Disclaimer: These modules are a mess and probably contain a lot of redundant code.
This is because they contains experiments constructed for specific purposes that are hard to predict ahead of time.
When done, the experiment functions are left as is, to be available for re-runs later if needed.
As a consequence of many of the experiments, the representations and/or other parts of the code have been changed, but the experiments should still hopefully work as expected.
experiments Module
Module containing methods for experimenting with the various graph representations.
Experiments not particular to any single representation is put here,
e.g. comparisons of the representations , or tests of properties of the datasets.
-
experiments.classification_comparison_graph(dataset='reuters', graph_type='co-occurrence', icc=None)
Experiment used for comparative evaluation of different network
representations on classification.
graph_type = ‘co-occurrence’ | ‘dependency’
icc determines whether to use _inverse corpus centrality_ in the vector representations.
-
experiments.dataset_stats(dataset)
- Print and plot statistics for a given dataset.
A histogram is plotted with the document length distribution of the data.
-
experiments.do_classification_experiments(dataset='tasa/TASA900', graph_types=[, 'co-occurrence', 'dependency', 'random'], use_frequency=True)
Experiment used for comparative evaluation of different network
representations on classification.
Toggle comparison with frequency-based methods using use_frequency.
-
experiments.do_retrieval_experiments(descriptions='air/problem_descriptions', solutions='air/solutions', graph_types=[, 'co-occurrence', 'dependency', 'random'], use_frequency=True)
Experiment used for comparative evaluation of different network
representations on the retrieval task.
Toggle comparison with frequency-based methods using use_frequency.
-
experiments.plot_sentence_lengths(datafile=None)
- Function for plotting histogram of sentence lengths within a given dataset.
-
experiments.print_network_props()
- Prints latex table with various properties for networks created from
texts in the datasets.
-
experiments.retrieval_comparison_graph(dataset='air', graph_type='co-occurrence', use_icc=False)
Experiment used for comparative evaluation of different network
representations on retrieval.
graph_type = ‘co-occurrence’ | ‘dependency’
icc determines whether to use _inverse corpus centrality_ in the vector representations.
-
experiments.solution_similarity_stats(dataset='air/solutions_preprocessed')
- Plots histogram of solution-solution similarity distribution of a dataset.
co_occurrence_experiments Module
Module containing experiments crated to evaluate and test various
incarnations of the co-occurrence network representation.
-
co_occurrence_experiments.complete_network(path='../data/air/problem_descriptions_text')
- Create and pickle to file a giant co-occurrence network for all documents
in the dataset pointed to by path.
-
co_occurrence_experiments.corpus_properties(dataset, context)
- Identify and pickle to file various properties of the given dataset.
These can alter be converted to pretty tables using
print_network_props().
-
co_occurrence_experiments.do_context_sentence_evaluation_classification()
- Experiment evaluating performance of sentences as contexts for
co-occurrence networks in the classification task.
-
co_occurrence_experiments.do_context_sentence_evaluation_retrieval()
- Experiment evaluating performance of sentences as contexts for
co-occurrence networks in the retrieval task.
-
co_occurrence_experiments.do_context_size_evaluation_classification()
- Experiment evaluating performance of different context sizes for
co-occurrence networks in the classification task.
-
co_occurrence_experiments.do_context_size_evaluation_retrieval()
- Experiment evaluating performance of different context sizes for
co-occurrence networks in the retrieval task.
-
co_occurrence_experiments.print_degree_distributions(dataset, context)
Extracts degree distribution values from networks, and print them to
cvs-file.
warning overwrites if file exists.
higher_order_experiments Module
Module containing experiments with higher order co-occurrence relations, as part of the
co-occurrence network representation.
-
higher_order_experiments.test_classification(orders=[, 1, 2, 3], order_weights=[, 1.0, 1.53, 1.51])
Test classification using different combinations of higher orders and weightings of these.
The list orders define which higher order relations to include.
The relative importance of the orders are defined by order_weights.
-
higher_order_experiments.test_combinations()
- Test all combinations of higher orders with classification and retrieval.
-
higher_order_experiments.test_retrieval(orders=[, 1, 2, 3], order_weights=[, 1.0, 1.53, 1.51])
Test retrieval using different combinations of higher orders and weightings of these.
The list orders define which higher order relations to include.
The relative importance of the orders are defined by order_weights.
-
higher_order_experiments.test_vocabulary_size(path='../data/air/problem_descriptions_preprocessed')
- Print vocabulary sizes for documents in dataset.
dependency_experiments Module
Experiments with various aspects of the dependency network representation.
-
dependency_experiments.centrality_weights_classification(weighted=True)
- Evaluate whether edge weights are beneficial to the depdendency
network represenation for the classification task.
-
dependency_experiments.centrality_weights_retrieval(weighted=True)
- Evaluate whether edge weights are beneficial to the depdendency
network represenation for the retrieval task.
-
dependency_experiments.corpus_dependency_properties(dataset='air/problem_descriptions')
- Identify and pickle to file various properties of the given dataset.
These can alter be converted to pretty tables using
print_network_props().
-
dependency_experiments.corpus_properties(dataset)
- Identify and pickle to file various properties of the given dataset.
These can alter be converted to pretty tables using
print_network_props().
-
dependency_experiments.edge_direction_evaluation(direction)
Evaluate impact of using different edge directions on dependency networks.
Values for direction: forward, backward, and undirected.
-
dependency_experiments.evaluate_dep_type_sets()
Evaluation of various sets of dependency relations.
Each set is excluded from the representation, and the performance recorded.
The best strategy is to exclude those dependencies which removal lead to the
greatest imporovement for the representation.
-
dependency_experiments.evaluate_dep_types()
- Leave-one-out evaluation of the various dependency types from the stanford parser.
-
dependency_experiments.plot_exp1()
- Plotting the results of the weight evaluation experiment.
-
dependency_experiments.plot_type_evaluation()
- Plot results from the evaluate_dep_types() experiment.
-
dependency_experiments.plot_type_sets_evaluation()
- Plot results from the evaluate_dep_type_sets() experiment.
-
dependency_experiments.print_common_hub_words(rem_stop_words)
Print a list of the most common hub words in the created networks.
Purpose of experiment was to show that hub words typically are stop words.
The rem_stop_words determine whether stop words are removed before creating
the networks.
-
dependency_experiments.print_degree_distributions(dataset)
Extracts degree distribution values from networks, and print them to
cvs-file.
warning overwrites if file exists.
-
dependency_experiments.print_hubs()
- Print results from print_common_hub_words() as latex table.
-
dependency_experiments.stanford_example()
- Example/test of the stanford parser.
-
dependency_experiments.stop_word_evaluation(rem_stop_words)
- Experiment for determining what effect removing stop words have on
dependency networks.