jcolibri.method.retrieve.NNretrieval.similarity.local.textual.compressionbased
Class CompressionBased

java.lang.Object
  extended by jcolibri.method.retrieve.NNretrieval.similarity.local.textual.compressionbased.CompressionBased
All Implemented Interfaces:
LocalSimilarityFunction

public class CompressionBased
extends java.lang.Object
implements LocalSimilarityFunction

This function returns the similarity of two strings using text compression. Compression Based Dissimilarity (CDM) is defined as follows CDM(x, y) = C(xy) / (C(x) + C(y)) where C(x) is the size of string x after compression (and C(y) similarly) and C(xy) is the size, after compression, of the string that comprises y concatenated to the end of x. See the following paper:

 inproceedings{Keogh04,
 author = {E. Keogh and S. Lonardi and C. Ratanamahatana},
 title ={Towards Parameter-Free Data Mining},
 booktitle = {Procs.\ of the 10th ACM SIGKDD, International Conference on Knowledge Discovery and Data Mining},
 year = {2004},
 pages = {206--215},
 address = {New York, NY, USA},
 }
Note that CDM is not a distance metric. It is NOT in general the case that CDM(x, y) = 0 iff x = y CDM(x, y) = CDM(y, x) CDM(x, y) + CDM(y, z) >= CDM(x, z) Do not use this measure in any application that requires metric properties. Values of CDM will lies in the range (0.5, 1.0]. We convert to a similarity measure by returning 1.0 - CDM(x, y). This similarity measure is effective in case-based spam filtering using GZip and PPM text compression algorithms. For the details, see:
 inproceedings{Delany-Bridge-2006,
 author = "S. J. Delany and D. Bridge",
 title = "Feature-Based and Feature-Free Textual {CBR}: A Comparison in Spam Filtering", 
 booktitle = "Procs.\ of the 17th Irish Conference on Artificial Intelligence and Cognitive Science", 
 pages = "244--253", 
 address = "Belfast, Northern Ireland",
 year = "2006"
 }
 
Its effectiveness outside of spam filtering (e.g. texts other than emails, short strings, images, etc.) has not been demonstrated. To be effective, the 'right' compression algorithm must be used.

Author:
Derek Bridge 18/05/07

Constructor Summary
CompressionBased(ICompressor compressor)
           
 
Method Summary
 double compare(java.lang.String x, java.lang.String y)
          Applies the similarity function.
 double compute(java.lang.Object o1, java.lang.Object o2)
          Applies the similarity function.
 boolean isApplicable(java.lang.Object o1, java.lang.Object o2)
          Indicates if the function is applicable to two objects
static void main(java.lang.String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CompressionBased

public CompressionBased(ICompressor compressor)
Parameters:
compressor - an object that encapsulates the compression algorithm to be used.
Method Detail

compare

public double compare(java.lang.String x,
                      java.lang.String y)
Applies the similarity function.

Parameters:
x -
y -
Returns:
the simlarity value.

compute

public double compute(java.lang.Object o1,
                      java.lang.Object o2)
               throws NoApplicableSimilarityFunctionException
Applies the similarity function.

Specified by:
compute in interface LocalSimilarityFunction
Parameters:
o1 - String
o2 - String
Returns:
result of apply the similarity function.
Throws:
NoApplicableSimilarityFunctionException

isApplicable

public boolean isApplicable(java.lang.Object o1,
                            java.lang.Object o2)
Description copied from interface: LocalSimilarityFunction
Indicates if the function is applicable to two objects

Specified by:
isApplicable in interface LocalSimilarityFunction

main

public static void main(java.lang.String[] args)

GAIA - Group for Artificial Intelligence Applications
http://gaia.fdi.ucm.es