|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjcolibri.method.retrieve.NNretrieval.similarity.local.textual.compressionbased.NormalisedCompression
public class NormalisedCompression
This function returns the similarity of two strings using text compression. Normalised Compression Distance (NCD) is defined as follows NCD(x, y) = (C(xy) - min(C(X), C(y)) / max(C(x), C(y)) where C(x) is the size of string x after compression (and C(y) similarly) and C(xy) is the size, after compression, of the string that comprises y concatenated to the end of x. See the following paper
inproceedings{Li03, author = "M. Li and X. Chen and X. Li and B. Ma and P. Vitanyi", title = "The Similarity Metric", booktitle = {Procs.\ of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms}, pages = {863--872}, address = {Baltimore, Maryland}, year = "2003", }Note that NCD is not a distance metric. It is NOT in general the case that NCD(x, y) = 0 iff x = y NCD(x, y) = NCD(y, x) NCD(x, y) + NCD(y, z) >= NCD(x, z) Do not use this measure in any application that requires metric properties. Values of NCD will lies in the range [0.0, 1.0 + e] where e is some small value. We convert to a similarity measure by returning 1.0 - min(NCD(x, y), 1.0). This similarity measure is effective in case-based spam filtering using the GZip text compression algorithm. For the details, see:
inproceedings{Delany-Bridge-2007, author = "S. J. Delany and D. Bridge", title = "Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering", booktitle = "Procs.\ of the 7th International Conference on Case Based Reasoning", address = "Belfast, Northern Ireland", year = "2007" }Its effectiveness outside of spam filtering (e.g. texts other than emails, short strings, images, etc.) has not been demonstrated. To be effective, the 'right' compression algorithm must be used.
Constructor Summary | |
---|---|
NormalisedCompression(ICompressor compressor)
|
Method Summary | |
---|---|
double |
compare(java.lang.String x,
java.lang.String y)
Applies the similarity function. |
double |
compute(java.lang.Object o1,
java.lang.Object o2)
Applies the similarity function. |
boolean |
isApplicable(java.lang.Object o1,
java.lang.Object o2)
Indicates if the function is applicable to two objects |
static void |
main(java.lang.String[] args)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public NormalisedCompression(ICompressor compressor)
compressor
- an object that encapsulates the compression algorithm to be
used.Method Detail |
---|
public double compare(java.lang.String x, java.lang.String y)
x
- y
-
public double compute(java.lang.Object o1, java.lang.Object o2) throws NoApplicableSimilarityFunctionException
compute
in interface LocalSimilarityFunction
o1
- Stringo2
- String
NoApplicableSimilarityFunctionException
public boolean isApplicable(java.lang.Object o1, java.lang.Object o2)
LocalSimilarityFunction
isApplicable
in interface LocalSimilarityFunction
public static void main(java.lang.String[] args)
|
GAIA - Group for Artificial Intelligence Applications http://gaia.fdi.ucm.es |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |