jcolibri.method.retrieve.NNretrieval.similarity.local.textual.compressionbased
Class NormalisedCompression

java.lang.Object
  extended by jcolibri.method.retrieve.NNretrieval.similarity.local.textual.compressionbased.NormalisedCompression
All Implemented Interfaces:
LocalSimilarityFunction

public class NormalisedCompression
extends java.lang.Object
implements LocalSimilarityFunction

This function returns the similarity of two strings using text compression. Normalised Compression Distance (NCD) is defined as follows NCD(x, y) = (C(xy) - min(C(X), C(y)) / max(C(x), C(y)) where C(x) is the size of string x after compression (and C(y) similarly) and C(xy) is the size, after compression, of the string that comprises y concatenated to the end of x. See the following paper

 inproceedings{Li03, author = "M. Li and X. Chen and X. Li and B. Ma and P.
                      Vitanyi", title = "The Similarity Metric", booktitle =
                      {Procs.\ of the 14th Annual ACM-SIAM Symposium on
                      Discrete Algorithms}, pages = {863--872}, address =
                      {Baltimore, Maryland}, year = "2003", }
 
Note that NCD is not a distance metric. It is NOT in general the case that NCD(x, y) = 0 iff x = y NCD(x, y) = NCD(y, x) NCD(x, y) + NCD(y, z) >= NCD(x, z) Do not use this measure in any application that requires metric properties. Values of NCD will lies in the range [0.0, 1.0 + e] where e is some small value. We convert to a similarity measure by returning 1.0 - min(NCD(x, y), 1.0). This similarity measure is effective in case-based spam filtering using the GZip text compression algorithm. For the details, see:
 inproceedings{Delany-Bridge-2007, author = "S. J. Delany and D. Bridge",
                                    title = "Catching the Drift: Using
                                    Feature-Free Case-Based Reasoning for Spam
                                    Filtering", booktitle = "Procs.\ of the
                                    7th International Conference on Case Based
                                    Reasoning", address = "Belfast, Northern
                                    Ireland", year = "2007" }
 
Its effectiveness outside of spam filtering (e.g. texts other than emails, short strings, images, etc.) has not been demonstrated. To be effective, the 'right' compression algorithm must be used.

Author:
Derek Bridge 18/05/07

Constructor Summary
NormalisedCompression(ICompressor compressor)
           
 
Method Summary
 double compare(java.lang.String x, java.lang.String y)
          Applies the similarity function.
 double compute(java.lang.Object o1, java.lang.Object o2)
          Applies the similarity function.
 boolean isApplicable(java.lang.Object o1, java.lang.Object o2)
          Indicates if the function is applicable to two objects
static void main(java.lang.String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NormalisedCompression

public NormalisedCompression(ICompressor compressor)
Parameters:
compressor - an object that encapsulates the compression algorithm to be used.
Method Detail

compare

public double compare(java.lang.String x,
                      java.lang.String y)
Applies the similarity function.

Parameters:
x -
y -
Returns:
the similarity value.

compute

public double compute(java.lang.Object o1,
                      java.lang.Object o2)
               throws NoApplicableSimilarityFunctionException
Applies the similarity function.

Specified by:
compute in interface LocalSimilarityFunction
Parameters:
o1 - String
o2 - String
Returns:
result of apply the similarity function.
Throws:
NoApplicableSimilarityFunctionException

isApplicable

public boolean isApplicable(java.lang.Object o1,
                            java.lang.Object o2)
Description copied from interface: LocalSimilarityFunction
Indicates if the function is applicable to two objects

Specified by:
isApplicable in interface LocalSimilarityFunction

main

public static void main(java.lang.String[] args)

GAIA - Group for Artificial Intelligence Applications
http://gaia.fdi.ucm.es