The Similarity Metric

Aus de_evolutionary_art_org
Wechseln zu: Navigation, Suche


Reference

Li., M., Chen, X., Li, X., Ma, B., Vitányi, P.: The Similarity Metric. In: Proceedings at the 14th ACM-SIAM Symposium on Discrete Algorithms (2003).

DOI

Abstract

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.

Extended Abstract

Bibtex

Used References

[1] J. Adachi and M. Hasegawa, “MOLPHY version 2.3: Programs for molecular phylogenetics based on maximum likelihood,” Comput. Sci. Monogr., Inst. Stat. Math., vol. 28, pp. 1–150.

[2] D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,” Phys. Rev. Lett., vol. 88, no. 4, p. 048702, 2002.

[3] P. Ball, “Algorithm makes tongue tree,” in Nature, Jan. 22, 2002.

[4] C. H. Bennett, P. Gács, M. Li, P. M. B. Vitányi, and W. Zurek, “Infor- mation distance,” IEEE Trans. Inform. Theory, vol. 44, pp. 1407–1423, July 1998.

[5] C. H. Bennett, M. Li, and B. Ma, “Chain letters and evolutionary histo- ries,” Scientific Amer., pp. 76–81, June 2003.

[6] UCMP Glossary: Phylogenetics [Online]. Available: http://www.ucmp.berkeley.edu/glossary/gloss1phylo.html.

[7] J. I. Boore and W. M. Brown, “Big trees from little genomes: Mitochon- drial gene order as a phylogenetic tool,” Curr. Opin. Genet. Dev., vol. 8, pp. 668–674, 1998.

[8] D. Bryant, V. Berry, P. Kearney, M. Li, T. Jiang, T. Wareham, and H. Zhang, “A practical algorithm for recovering the best supported edges of an evolutionary tree,” in Proc. 11th ACM-SIAM Symp. Discrete Algo- rithms, San Francisco, CA, Jan. 9–11, 2000, pp. 287–296.

[9] Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Pääbo, and M. Hasegawa, “Conflict among individual mi- tochondrial proteins in resolving the phylogeny of Eutherian orders,” J. Mol. Evol., vol. 47, pp. 307–322, 1998.

[10] X. Chen, B. Francia, M. Li, B. Mckinnon, and A. Seker, “Shared infor- mation and program plagiarism detection,” IEEE Trans. Inform. Theory, vol. 50, pp. 1545–1551, July 2004.

[11] X. Chen, S. Kwong, and M. Li, “A compression algorithm for DNA sequences and its applications in genome comparison,” in Genome In- formatics (Proc. 10th Workshop on Genome Informatics), K. Asai, S. Myano, and T. Takagi, Eds. Tokyo, Japan: Universal Academy, 1999, pp. 51–61.

[12], “A compression algorithm for DNA sequences,” IEEE-EMB Spe- cial Issue on Bioinformatics, vol. 20, no. 4, pp. 61–66, 2001.

[13] R. Cilibrasi. The Complearn Toolkit. [Online]. Available: http://com- plearn.sourceforge.net/

[14] R. Cilibrasi, R. de Wolf, and P. Vitanyi, “Algorithmic clustering of music,” Computer Music J. Also available at http://arxiv.org/archive/ cs/0303025, to be published.

[15] R. Cilibrasi and P. Vitányi, “Clustering by compression,” IEEE Trans. Infom. Theory. Also available at http://arxiv.org/abs/cs.CV/0312044, submitted for publication.

[16] G. Cormode, M. Paterson, S. C. Sahinalp, and U. Vishkin, “Commu- nication complexity of document exchange,” in Proc. SODA 2000, pp. 197–206.

[17] J.-P. Delahaye, “Classer musiques, langues, images, textes et genomes,” Pour La Science, vol. 317, pp. 98–103, Mar. 2004. French edition of Scientific American.

[18] W. M. Fitch and E. Margoliash, “Construction of phylogenetic trees,” Science, vol. 155, pp. 279–284, 1967.

[19] S. T. Fitz-Gibbon and C. H. House, “Whole genome-based phylogenetic analysis of free-living macroorganizms,” Nucleic Acids Res., vol. 27, pp. 4218–4222, 1999.

[20] P. Gács, “On the symmetry of algorithmic information,” Sov. Math.–Dokl., vol. 15, pp. 1477–1480, 1974.

[21] P. Gács, J. Tromp, and P. Vitányi, “Algorithmic statistics,” IEEE Trans. Inform. Theory, vol. 47, pp. 2443–2463, Sept. 2001.

[22] W. Gasarch, personal communication, Aug. 12, 2001.

[23] United Nations General Assembly Resolution 217 A (III) of 10 De- cember 1948: Universal Declaration of Human Rights [Online]. Avail- able: http://www.un.org/Overview/rights.html

[24] S. Grumbach and F. Tahi, “A new challenge for compression algorithms: Genetic sequences,” J. Inform. Process. Manag., vol. 30, pp. 875–866, 1994.

[25] D. Hammer, A. E. Romashchenko, A. K. Shen’, and N. K. Verashchagin, “Inequalities for Shannon entropies and Kolmogorov complexities,” J. Comput. Syst. Sci., vol. 60, no. 2, pp. 442–464, 2000.

[26] S. Hannenhalli and P. Pevzner, “Transforming cabbage into turnip: Poly- nomial algorithm for sorting signed permutations by reversals,” in Proc. 27th ACM Symp. Theory of Computing, Las Vegas, NV, May 29–June 1, 1995, pp. 178–189.

[27] E. Keogh, S. Lonardi, and C. A. Rtanamahatana, “Toward param- eter-free data mining,” in Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Seattle, WA, Aug. 22–25, 2004, pp. 206–215.

[28] J. Kececioglu and D. Sankoff, “Exact and approximation algorithms for the inversion distance,” Algorithmica, vol. 13, pp. 180–210, 1995.

[29] A. N. Kolmogorov, “Three approaches to the quantitative definition of information,” Probl. Inform. Transm., vol. 1, no. 1, pp. 1–7, 1965.

[30] E. V. Koonin, “The emerging paradigm and open problems in compara- tive genomics,” Bioinformatics, vol. 15, pp. 265–266, 1999.

[31] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, “An information-based sequence distance and its application to whole mitochondrial genome phylogeny,” Bioinformatics, vol. 17, no. 2, pp. 149–154, 2001.

[32] M. Li and P. M. B. Vitányi, “Reversibility and adiabatic computation: Trading time and space for energy,” in Proc. Roy. Soc. London, Ser. A, vol. 452, 1996, pp. 769–789.

[33] , “Reversibility and adiabatic computation: Trading time and space for energy,” in Proc. Roy. Soc. London, Ser. A, vol. 452, 1996, pp. 769–789.

[34], An Introduction to Kolmogorov Complexity and its Applications, 2nd ed. New York: Springer-Verlag, 1997.

[35], “Algorithmic complexity,” in International Encyclopedia of the Social & Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds. Oxford, U.K.: Pergamon, 2001, pp. 376–382.

[36] A. Londei, V. Loreto, and M. O. Belardinelli, “Music style and author- ship categorization by informative compressors,” in Proc. 5th Triannual Conf. Eur. Soc. for the Cognitive Sciences of Music, Hannover, Germany, Sept. 8–13, 2003, pp. 200–203.

[37] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and more sensitive homology search,” Bioinformatics, vol. 18, no. 3, pp. 440–445, 2002.

[38] H. Muir, “Software to unzip identity of unknown composers,” in New Scientist, Apr. 12, 2003.

[39] S. Muthukrishnan and S. C. Sahinalp, “Approximate nearest neighbors and sequence comparison with block operations,” in Proc. 32nd ACM Symp. Theory of Computing, Portland, OR, May 21–23, 2000, pp. 416–424.

[40] A. A. Muchnik and N. K. Vereshchagin, “Logical operations and Kol- mogorov complexity II,” in Proc. 16th IEEE Conf. Computational Com- plexity, Chicago, IL, June 18–21, 2001, pp. 256–265.

[41] J. H. Nadeau and D. Sankoff, “Counting on comparative maps,” Trends Genet., vol. 14, pp. 495–501, 1998.

[42] K. Patch, “Software sorts tunes,” in Technology Res. News, Apr. 23/30, 2003.

[43] C. Rajski, “A metric space of discrete probability distributions,” Inform. Contr., vol. 4, pp. 371–377, 1961.

[44] A. Romashchenko, A. Shen, and N. Vereshchagin, “Combinatorial in- terpretation of Kolmogorov complexity,” Theory Comput. Sci., vol. 271, no. 1–2, pp. 111–123, 2002.

[45] D. Sankoff, “Mechanisms of genome evolution: Models and inference,” Bull. Int. Statist. Inst., vol. 47, no. 3, pp. 461–475, 1999.

[46] N. Saitou and M. Nei, “The neighbor-joining method: A new method for reconstructing phylogenetic trees,” Mol. Biol. Evol., vol. 4, pp. 406–425, 1987.

[47] B. Snel, P. Bork, and M. A. Huynen, “Genome phylogeny based on gene content,” Nature Genet., vol. 21, pp. 108–110, 1999.

[48] A. Rokas, B. L. Williams, N. King, and S. B. Carroll, “Genome-scale approaches to resolving incongruence in molecular phylogenies,” in Na- ture, vol. 425, Oct. 25, 2003, pp. 798–804.

[49] J-S. Varre, J-P. Delahaye, and É Rivals, “The transformation distance: A dissimilarity measure based on movements of segments,” Bioinfor- matics, vol. 15, no. 3, pp. 194–202, 1999.

[50] N. K. Vereshchagin and M. Vyugin, “Independent minimum length pro- grams to translate between given strings,” Theory Comput. Sci., vol. 271, no. 1–2, pp. 131–143, 1999.

[51] L.-S. Wang and T. Warnow, “Estimating true evolutionary distances be- tween genomes,” in Proc. 33rd ACM Symp. Theory Comput., Heraklion, Crete, Greece, July 6–8, 2001, pp. 637–646.

[52] J. C. Wooley, “Trends in computational biology: A summary based on a RECOMB plenary lecture,” J. Comput. Biol., vol. 6, pp. 459–474, 1999.

[53] P. N. Yianilos, “Normalized Forms for Two Common Metrics,” NEC Res. Inst., Rep. 91-082-9027-1, 1991. Revision 7/7/2002 at http://www.pnylab.com/pny/.


Links

Full Text

http://www.cwi.nl/~paulv/papers/similarity.pdf

intern file

Sonstige Links

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.3346