Improved benchmarks for computational motif discovery

Sandve, Geir Kjetil; Abul, Osman; Walseng, Vegard; Drabløs, Finn

dc.contributor.author	Sandve, Geir Kjetil
dc.contributor.author	Abul, Osman
dc.contributor.author	Walseng, Vegard
dc.contributor.author	Drabløs, Finn
dc.date.accessioned	2015-09-21T11:19:25Z
dc.date.accessioned	2016-06-09T11:32:13Z
dc.date.available	2015-09-21T11:19:25Z
dc.date.available	2016-06-09T11:32:13Z
dc.date.issued	2007-06-08
dc.identifier.citation	BMC Bioinformatics 2007, 8(193)	nb_NO
dc.identifier.issn	1471-2105
dc.identifier.uri	http://hdl.handle.net/11250/2392024
dc.description.abstract	Background An important step in annotation of sequenced genomes is the identification of transcription factor binding sites. More than a hundred different computational methods have been proposed, and it is difficult to make an informed choice. Therefore, robust assessment of motif discovery methods becomes important, both for validation of existing tools and for identification of promising directions for future research. Results We use a machine learning perspective to analyze collections of transcription factors with known binding sites. Algorithms are presented for finding position weight matrices (PWMs), IUPAC-type motifs and mismatch motifs with optimal discrimination of binding sites from remaining sequence. We show that for many data sets in a recently proposed benchmark suite for motif discovery, none of the common motif models can accurately discriminate the binding sites from remaining sequence. This may obscure the distinction between the potential performance of the motif discovery tool itself versus the intrinsic complexity of the problem we are trying to solve. Synthetic data sets may avoid this problem, but we show on some previously proposed benchmarks that there may be a strong bias towards a presupposed motif model. We also propose a new approach to benchmark data set construction. This approach is based on collections of binding site fragments that are ranked according to the optimal level of discrimination achieved with our algorithms. This allows us to select subsets with specific properties. We present one benchmark suite with data sets that allow good discrimination between positive and negative instances with the common motif models. These data sets are suitable for evaluating algorithms for motif discovery that rely on these models. We present another benchmark suite where PWM, IUPAC and mismatch motif models are not able to discriminate reliably between positive and negative instances. This suite could be used for evaluating more powerful motif models. Conclusion Our improved benchmark suites have been designed to differentiate between the performance of motif discovery algorithms and the power of motif models. We provide a web server where users can download our benchmark suites, submit predictions and visualize scores on the benchmarks.	nb_NO
dc.language.iso	eng	nb_NO
dc.publisher	BioMed Central	nb_NO
dc.rights	Navngivelse 3.0 Norge	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/no/	*
dc.title	Improved benchmarks for computational motif discovery	nb_NO
dc.type	Journal article	nb_NO
dc.type	Peer reviewed	nb_NO
dc.date.updated	2015-09-21T11:19:25Z
dc.source.volume	8	nb_NO
dc.source.journal	BMC Bioinformatics	nb_NO
dc.source.issue	193	nb_NO
dc.identifier.doi	10.1186/1471-2105-8-193
dc.identifier.cristin	369853
dc.description.localcode	© 2007 Sandve et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.	nb_NO

Tilhørende fil(er)

Filnavn:: 1471-2105-8-193.pdf
Størrelse:: 908.9Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6552]
Institutt for klinisk og molekylær medisin [3426]
Publikasjoner fra CRIStin - NTNU [37221]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 3.0 Norge