Learning pattern models from examples
MetadataVis full innførsel
The aim of this thesis is twofold. Firstly, it is a survey of some of the most prevalent pattern models used in motif discovery algorithms. The main goal of the survey is to see how well these models with all their structural differences and varying levels of complexity and flexibility are able to actually represent binding site motifs. This is done in an attempt to map the advantages and disadvantages of applying a given pattern model to motif discovery tasks, and to see whether any of the models separates itself from the rest (either positively or negatively). To get fair results, the models are placed within a framework, which uses an exhaustive search to find best-case patterns from each model, and these are then compared to see if differences can be found in the models ability to a) separate motif instances from background (separation) b) predicate previously unknown motif instances (prediction) However, such exhaustive searching usually takes very long time, and it becomes necessary to find ways to speed up the process. Thus, the second objective of the thesis is to optimize the search for all three pattern models so that they are able to find the optimal pattern of a pattern model within a reasonable timeframe. Regarding the first goal, it seems clear that, if it is able to be trained correctly, the PWM outshines both the mismatch expression and the IUPAC string. Both the separation- and the prediction-performance of the PWM were quite good, even if the basic PWM algorithm basically just generates a profile of all the instances in the positive set. Regarding the second goal, the final algorithms of the mismatch expression and the IUPAC string were both able to find optimal patterns very quickly, even for very large values of m. There was not much point in finding ways to speed up the PWM algorithm as its running time was so fast anyway.