Nucleotide Sequence Similarity Search Using Techniques from Content-Based Image Retrieval
Abstract
The amount of DNA data continues to increase exponentially as a result of high-throughput next generation sequencing. Current state-of-the-art tools for nucleotidesequence similarity search are not equipped to deal with this growth and newthinking is needed to tackle the rising scalability challenges.This thesis investigates the experimental approach of translating DNA sequencesinto images and applying state of the art techniques from the field of content-based image retrieval to index and search the resulting images. The challengesof translating DNA sequences into images are discussed and two algorithms forimage generation are proposed. We look into the different feature descriptors thatare available and evaluate them in the context of the generated images. Lastly theapproach as a whole is evaluated with the mean average precision metric usingBLAST as the gold standard reference.The results show that the proposed approach is not successful in approachingBLAST in retrieval performance, but offers a significant reduce in index sizesand thus better performance and scalability on large DNA databases.