RAFT - Real And False TFBSs
Abstract
Most prediction methods for finding potential DNA binding sites for a specific transcription factor (TF) use a model for the transcription factor binding site (TFBS), and compare each position of the DNA sequence (e.g. a genome) against this model. Any position with a significant score against the model may then be classified as a potential binding site.Common models are e.g. consensus sequence, HMM and PWM.The main problem with this approach is that it generates a large number of false positive TFBS predictions. It has actually been estimated that in most cases the estimate will be completely dominated by false positives.This project will try to develop a context-sensitive approach for identification of real binding sites for a given TF, independent of cell type.The basic assumption in this project is that real TFBSs are found in a suitable genomic context, whereas random binding sites will lack any common context. The idea is then to use properties that somehow can be associated with regulatory regions to develop a classifier for PWM-based TFBS predictions. And using machine learning approach the classifier will (hopefully) remove most false positive TFBS predictions.