A Classifier for Microprocessor Processing Site Prediction in Human MicroRNAs
Abstract
MircoRNAs are ~22nts long non-coding RNA sequences that play a central role in gene regulation. As the microRNAs are temporary and not necessarily expressed when RNA from tissue samples are sequenced, bioinformatics is an important part of microRNA discovery. Most of the computational microRNA discovery approaches are based on conservation between human and other species. Recent results, however, estimate that there exists around 350 microRNAs unique to human. It is therefore a need for methods that use characteristics in the primary microRNA transcript to predict microRNA candidates. The main problem with such methods is, however, that many of the characteristics in the primary microRNA transcript are correlated with the location where the Microprocessor complex cleaves the primary microRNA into the precursor, which is unknown until the candidate is experimentally verified. This work presents a method based on support vector machines (SVM) for Microprocessor processing site prediction in human microRNAs. The SVM correctly predicts the processing site for 43% of the known human microRNAs and shows a great performance distinguishing random hairpins and microRNAs. The processing site SVM is useful for microRNA discovery in two ways. One, the predicted processing sites can be used to build an SVM with more distinct features and, thus, increase the accuracy of the microRNA gene predictions. Two, it generates information that can be used to predict microRNA candidates directly, such as the score differences between the candidate's potential and predicted processing sites. Preliminary results show that an SVM that uses the predictions from the processing site SVM and trains explicitly to separate microRNAs and random hairpins performs better than current prediction-based approaches. This illustrates the potential gain of using the processing site predictions in microRNA gene prediction.