Approximate search with constraints on indels with application in SPAM filtering
Journal article, Peer reviewed
MetadataShow full item record
Original versionChitrakar, A. S., & Petrovic, S. (2015). Approximate search with constraints on indels with application in SPAM filtering. Norsk Informasjonssikkerhetskonferanse (NISK), 22-33. Retrieved from http://www.frisc.no/arrangementer/nisk-2012-4/
Finding distorted occurrences of search pattern(s) in the search string by applying constraints on elementary edit operations (indels (insertions/deletions) and substitutions) is a new category of the approximate string search problem that we introduce in this paper. The constraint on the total number of indels can improve the search efficiency when one knows the probabilities of these edit operations for the distorted pattern/text. Two approximate search algorithms with such a constraint, CRBP-Indels and Sankoff-Indels, are presented here and their performances are evaluated for different probabilities of insertions, deletions, and substitutions. The experimental results show that CRBP-Indels has better performance over Sankoff-Indels when the number of indels is greater than the number of substitutions. However, the Sankoff- Indels algorithm is better if the number of substitutions is greater than the number of indels. Possible application of these algorithms is in SPAM filtering for detection of deliberately distorted SPAM-words. In such a scenario, the number of indels applied on the original SPAM-words must be limited in order to maintain their intelligibility.