On the ‘dimensionality curse’ and the ‘self-similarity blessing

Aus de_evolutionary_art_org
Wechseln zu: Navigation, Suche


Referenz

F. Korn, B. Pagel, C. Faloutsos: On the ‘dimensionality curse’ and the ‘self-similarity blessing. IEEE Transactions on Knowledge and Data Engineering 13 (1) (2001) 96–111.

DOI

http://dx.doi.org/10.1109/69.908983

Abstract

Spatial queries in high-dimensional spaces have been studied extensively recently. Among them, nearest-neighbor queries are important in many settings, including spatial databases (Find the $k$ closest cities) and multimedia databases (Find the $k$ most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious ¿curse of dimensionality.¿ Here, we show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic (¿fractal¿) dimensionalities that are much lower than their embedding dimension, e.g., due to subtle dependencies between attributes. In this paper, we show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets. The practical contributions of this work are our accurate formulas, which can be used for query optimization in spatial and multimedia databases. The major theoretical contribution is the ¿deflation¿ of the dimensionality curse: Our formulas and our experiments show that previous worst-case analyses of nearest-neighbor search in high dimensions are overpessimistic to the point of being unrealistic. The performance depends critically on the intrinsic (¿fractal¿) dimensionality as opposed to the embedding dimension that the uniformity and independence assumptions incorrectly imply.

Extended Abstract

Bibtex

@article{Korn:2001:DCS:627332.628114,
author = {Korn, Flip and Pagel, Bernd-Uwe and Faloutsos, Christos},
title = {On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'},
journal = {IEEE Trans. on Knowl. and Data Eng.},
issue_date = {January 2001},
volume = {13},
number = {1},
month = jan,
year = {2001},
issn = {1041-4347},
pages = {96--111},
numpages = {16},
url = {http://dx.doi.org/10.1109/69.908983 http://de.evo-art.org/index.php?title=On_the_‘dimensionality_curse’_and_the_‘self-similarity_blessing},
doi = {10.1109/69.908983},
acmid = {628114},
publisher = {IEEE Educational Activities Department},
address = {Piscataway, NJ, USA},
keywords = {Nearest-neighbor search, multimedia indexing, fractals.},
} 

Used References

1 Rakesh Agrawal , Christos Faloutsos , Arun N. Swami, Efficient Similarity Search In Sequence Databases, Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, p.69-84, October 13-15, 1993 http://dl.acm.org/citation.cfm?id=652239&CFID=558819604&CFTOKEN=68186175

2 Lars Arge , Vasilis Samoladas , Jeffrey Scott Vitter, On two-dimensional indexability and optimal range search indexing, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.346-357, May 31-June 03, 1999, Philadelphia, Pennsylvania, USA http://doi.acm.org/10.1145/303976.304010

3 Norbert Beckmann , Hans-Peter Kriegel , Ralf Schneider , Bernhard Seeger, The R*-tree: an efficient and robust access method for points and rectangles, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.322-331, May 23-26, 1990, Atlantic City, New Jersey, USA http://doi.acm.org/10.1145/93597.98741

4 Alberto Belussi , Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension, Proceedings of the 21th International Conference on Very Large Data Bases, p.299-310, September 11-15, 1995 http://dl.acm.org/citation.cfm?id=673166&CFID=558819604&CFTOKEN=68186175

5 Stefan Berchtold , Christian Böhm , Bernhard Braunmüller , Daniel A. Keim , Hans-Peter Kriegel, Fast parallel similarity search in multimedia databases, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.1-12, May 11-15, 1997, Tucson, Arizona, USA http://doi.acm.org/10.1145/253260.253263

6 Stefan Berchtold , Christian Böhm , Hans-Peter Kriegal, The pyramid-technique: towards breaking the curse of dimensionality, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.142-153, June 01-04, 1998, Seattle, Washington, USA [http://doi.acm.org/10.1145/276304.276318

7 Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces, Proceedings of the 16th International Conference on Data Engineering, p.577, February 28-March 03, 2000 http://dl.acm.org/citation.cfm?id=847319&CFID=558819604&CFTOKEN=68186175

8 Stefan Berchtold , Christian Böhm , Daniel A. Keim , Hans-Peter Kriegel, A cost model for nearest neighbor search in high-dimensional data space, Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.78-86, May 11-15, 1997, Tucson, Arizona, USA http://doi.acm.org/10.1145/263661.263671

9 Stefan Berchtold , Bernhard Ertl , Daniel A. Keim , Hans-Peter Kriegel , Thomas Seidl, Fast Nearest Neighbor Search in High-Dimensional Space, Proceedings of the Fourteenth International Conference on Data Engineering, p.209-218, February 23-27, 1998 http://dl.acm.org/citation.cfm?id=653617&CFID=558819604&CFTOKEN=68186175

10 Stefan Berchtold , Daniel A. Keim , Hans-Peter Kriegel, The X-tree: An Index Structure for High-Dimensional Data, Proceedings of the 22th International Conference on Very Large Data Bases, p.28-39, September 03-06, 1996 http://dl.acm.org/citation.cfm?id=673502&CFID=558819604&CFTOKEN=68186175

11 K. Beyer J. Goldstein R. Ramakrishnan and U. Shaft, “When Is ‘Nearest Neighbor’ Meaningful?,” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217–235, Jan. 1999.

12 Allan Borodin , Rafail Ostrovsky , Yuval Rabani, Lower bounds for high dimensional nearest neighbor search and related problems, Proceedings of the thirty-first annual ACM symposium on Theory of computing, p.312-321, May 01-04, 1999, Atlanta, Georgia, USA http://doi.acm.org/10.1145/301250.301330

13 S. Christodoulakis, Implications of certain assumptions in database performance evauation, ACM Transactions on Database Systems (TODS), v.9 n.2, p.163-186, June 1984 http://doi.acm.org/10.1145/329.318578

14 Paolo Ciaccia , Marco Patella , Pavel Zezula, A cost model for similarity queries in metric spaces, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.59-68, June 01-04, 1998, Seattle, Washington, USA http://doi.acm.org/10.1145/275487.275495

15 Christos Faloutsos , Ibrahim Kamel, Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension, Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.4-13, May 24-27, 1994, Minneapolis, Minnesota, USA http://doi.acm.org/10.1145/182591.182593

16 Christos Faloutsos , Bernhard Seeger , Agma Traina , Caetano Traina, Jr., Spatial join selectivity using power laws, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.177-188, May 15-18, 2000, Dallas, Texas, USA http://doi.acm.org/10.1145/342009.335412

17 Volker Gaede , Oliver Günther, Multidimensional access methods, ACM Computing Surveys (CSUR), v.30 n.2, p.170-231, June 1998 http://doi.acm.org/10.1145/280277.280279

18 Aristides Gionis , Piotr Indyk , Rajeev Motwani, Similarity Search in High Dimensions via Hashing, Proceedings of the 25th International Conference on Very Large Data Bases, p.518-529, September 07-10, 1999 http://dl.acm.org/citation.cfm?id=671516&CFID=558819604&CFTOKEN=68186175

19 Jonathan Goldstein , Raghu Ramakrishnan, Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches, Proceedings of the 26th International Conference on Very Large Data Bases, p.429-440, September 10-14, 2000 http://dl.acm.org/citation.cfm?id=671849&CFID=558819604&CFTOKEN=68186175

20 Antonin Guttman, R-trees: a dynamic index structure for spatial searching, Proceedings of the 1984 ACM SIGMOD international conference on Management of data, June 18-21, 1984, Boston, Massachusetts http://doi.acm.org/10.1145/602259.602266

21 Joseph M. Hellerstein , Elias Koutsoupias , Christos H. Papadimitriou, On the analysis of indexing schemes, Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.249-256, May 11-15, 1997, Tucson, Arizona, USA http://doi.acm.org/10.1145/263661.263688

22 Joseph M. Hellerstein , Jeffrey F. Naughton , Avi Pfeffer, Generalized Search Trees for Database Systems, Proceedings of the 21th International Conference on Very Large Data Bases, p.562-573, September 11-15, 1995 http://dl.acm.org/citation.cfm?id=673145&CFID=558819604&CFTOKEN=68186175

23 Alexander Hinneburg , Charu C. Aggarwal , Daniel A. Keim, What Is the Nearest Neighbor in High Dimensional Spaces?, Proceedings of the 26th International Conference on Very Large Data Bases, p.506-515, September 10-14, 2000 http://dl.acm.org/citation.cfm?id=671675&CFID=558819604&CFTOKEN=68186175

24 Gisli R. Hjaltason , Hanan Samet, Ranking in Spatial Databases, Proceedings of the 4th International Symposium on Advances in Spatial Databases, p.83-95, August 06-09, 1995 http://dl.acm.org/citation.cfm?id=718930&CFID=558819604&CFTOKEN=68186175

25 Ibrahim Kamel , Christos Faloutsos, Hilbert R-tree: An Improved R-tree using Fractals, Proceedings of the 20th International Conference on Very Large Data Bases, p.500-509, September 12-15, 1994 http://dl.acm.org/citation.cfm?id=673001&CFID=558819604&CFTOKEN=68186175

26 Norio Katayama , Shin'ichi Satoh, The SR-tree: an index structure for high-dimensional nearest neighbor queries, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.369-380, May 11-15, 1997, Tucson, Arizona, USA http://doi.acm.org/10.1145/253260.253347

27 V. Kobla D. Doermann K.-I. Lin and C. Faloutsos, “Compressed Domain Video Indexing Techniques Using DCT and Motion Vector Information in MPEG Video,” Proc. Int'l Soc. Optical Eng., vol. 2,916, Nov. 1996.

28 Flip Korn , Nikolaos Sidiropoulos , Christos Faloutsos , Eliot Siegel , Zenon Protopapas, Fast Nearest Neighbor Search in Medical Image Databases, Proceedings of the 22th International Conference on Very Large Data Bases, p.215-226, September 03-06, 1996 http://dl.acm.org/citation.cfm?id=673493&CFID=558819604&CFTOKEN=68186175

29 Scott T. Leutenegger , J. M. Edgington , Mario A. Lopez, STR: A Simple and Efficient Algorithm for R-Tree Packing, Proceedings of the Thirteenth International Conference on Data Engineering, p.497-506, April 07-11, 1997 http://dl.acm.org/citation.cfm?id=653437&CFID=558819604&CFTOKEN=68186175

30 King Ip Lin , H. V. Jagadish , Christos Faloutsos, The TV-tree: an index structure for high-dimensional data, The VLDB Journal — The International Journal on Very Large Data Bases, v.3 n.4, October 1994 http://dl.acm.org/citation.cfm?id=615210&CFID=558819604&CFTOKEN=68186175

31 F. Le Lionnais, Le Nombres Remarquables. Paris: Hermann, 1983.

32 Bernd-Uwe Pagel , Hans-Werner Six , Heinrich Toben , Peter Widmayer, Towards an analysis of range query performance in spatial data structures, Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.214-221, May 25-28, 1993, Washington, D.C., USA http://doi.acm.org/10.1145/153850.153878

33 Bernd-Uwe Pagel , Hans-Werner Six, Are window queries representative for arbitrary range queries?, Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.150-160, June 04-06, 1996, Montreal, Quebec, Canada http://doi.acm.org/10.1145/237661.237704

34 Apostolos Papadopoulos , Yannis Manolopoulos, Performance of Nearest Neighbor Queries in R-Trees, Proceedings of the 6th International Conference on Database Theory, p.394-408, January 08-10, 1997 http://dl.acm.org/citation.cfm?id=656242&CFID=558819604&CFTOKEN=68186175

35 Yasushi Sakurai , Masatoshi Yoshikawa , Shunsuke Uemura , Haruhiko Kojima, The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation, Proceedings of the 26th International Conference on Very Large Data Bases, p.516-526, September 10-14, 2000 http://dl.acm.org/citation.cfm?id=671676&CFID=558819604&CFTOKEN=68186175

36 Vasilis Samoladas , Daniel P. Miranker, A lower bound theorem for indexing schemes and its application to multidimensional range queries, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.44-51, June 01-04, 1998, Seattle, Washington, USA http://doi.acm.org/10.1145/275487.275493

37 Timos K. Sellis , Nick Roussopoulos , Christos Faloutsos, The R+-Tree: A Dynamic Index for Multi-Dimensional Objects, Proceedings of the 13th International Conference on Very Large Data Bases, p.507-518, September 01-04, 1987 http://dl.acm.org/citation.cfm?id=671636&CFID=558819604&CFTOKEN=68186175

38 Dennis Shasha , Tsong-Li Wang, New techniques for best-match retrieval, ACM Transactions on Information Systems (TOIS), v.8 n.2, p.140-158, Apr. 1990 http://doi.acm.org/10.1145/96105.96111

39 Yannis Theodoridis , Timos Sellis, A model for the prediction of R-tree performance, Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.161-171, June 04-06, 1996, Montreal, Quebec, Canada http://doi.acm.org/10.1145/237661.237705

40 M. Turk and A. Pentland, “Eigenfaces for Recognition,” J. Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.

41 Howard D. Wactlar , Takeo Kanade , Michael A. Smith , Scott M. Stevens, Intelligent Access to Digital Video: Informedia Project, Computer, v.29 n.5, p.46-52, May 1996 http://dx.doi.org/10.1109/2.493456

42 Roger Weber , Hans-Jörg Schek , Stephen Blott, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, Proceedings of the 24rd International Conference on Very Large Data Bases, p.194-205, August 24-27, 1998 http://dl.acm.org/citation.cfm?id=671192&CFID=558819604&CFTOKEN=68186175

Links

Full Text

internal file


Sonstige Links