350 likes | 566 Views
High-dimensional Indexing based on Dimensionality Reduction . Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi. Outlines . Introduction Global Dimensionality Reduction Local Dimensionality Reduction Indexing Reduced-Dim Space
E N D
High-dimensional Indexing based on Dimensionality Reduction Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi
Outlines • Introduction • Global Dimensionality Reduction • Local Dimensionality Reduction • Indexing Reduced-Dim Space • Effects of Dimensionality Reduction • Behaviors of Distance Matrices • Conclusion and Future Works
Introduction • High-Dim Applications: • Multimedia, time-series, scientific, market basket, etc. • Various Trees Proposed: • R-tree, R*, R+, X, Skd, SS, M, KDB, TV, Buddy, Grid File, Hybrid, iDistance, etc. • Dimensionality Curse • Efficiency drops quickly as dim increases.
Introduction • Dimensionality Reduction Techniques • GDR • LDR • High-Dim Indexing on RDS • Existing Indexing on single RDS • Global Indexing on multiple RDS • Side Effects of DR • Different Behaviors of Distance Matrices • Conclusion Future Work
GDR • Perform Reduction on the whole dataset.
GDR Improving query accuracy by doing principal components analysis (PCA)
GDR • Using Aggregate Data for Reduction in Dynamic Spaces [8].
GDR • Works for Globally Correlated data. • GDR may cause significant info loss in real data.
LDR [5] • Find locally correlated data clusters • Perform dimensionality reduction on on the clusters individually
LDR - Definitions • Cluster and subspace • Reconstruction Distance
LDR - Constraints on cluster • Reconstruction distance bound I.e. MaxReconDist • Dimensionality bound I.e. MaxDim • Size Bound I.e. MinSize
LDR - Clustering Algo • Construct spatial clusters • Determine max number of clusters: M • Determine the cluster range: e • Choose a set of well scattered points as the centroids (C) of each spatial cluster • Apply the formula to all data points: Distance (P, Cclosest) <= e • Update the centroids of the cluster
LDR - Clustering Algo (cont) • Compute principal component (PC) • Perform PCA individually to all clusters • Compute mean value of each cluster points, I.e. Ei • Determine subspace dimensionality • Progressively checking each point against: MaxReconDist and MaxDim • Decide the optimal demensionality for each cluster
LDR - Clustering Algo (cont) • Recluster points • Insert each points into the a suitable cluster or the outlier set O I.e. ReconDist(P.S) <= MaxReconDist
LDR - Clustering Algo (cont) • Finally, apply the Size Bound to eliminate clusters with too few population. Redistribute the points to other clusters or set O.
LDR - Compare to GDR • LDR improves retrieval efficiency and effectiveness by capture more details on local data set. • But it consumes higher computational cost during the reduction steps.
LDR • LDR cannot discover all the possible correlated clusters.
Indexing RDS • GDR • One RDS only • Applying existing multi-dim indexing structure, e.g. R-tree, M-Tree… • LDR • Several RDS in different axis systems • Global Indexing Structure
Global Indexing Each RDS corresponds to one tree.
Side Effects of DR • Information loss -> Lower precision • Possible Improvement? • Text Domain • DR -> qualitative improvement • Least information loss -> highest precision -> Highest qualitative improvement
Sim for doc Sim for term & correlation Side Effects of DR • Latent Semantic Indexing (U & V) (LSI) [9,10,11]
Side Effects of DR • DR effectively improve the data representation by understanding the data in terms of concepts rather than words. • Directions with greatest variance results in the use of Semantic aspects of data.
Side Effects of DR • Dependency among attributes results in poor measurements if using L-norm matrices. • Dimensions with largest eigenvalues = highest quality [2]. • So what else we have to consider?. Inter-correlations
Mahalanobis Distance Normalized Mahalanobis Distance
Mahalanobis vs. L-norm • Take local shape into consideration by computing variance and covariance. • Tend to group points into elliptical clusters, which defines a multi-dim space whose boundaries determine the range of degree of correlation that is suitable for dim reduction. • Define the standard deviation boundary of the cluster.
Incremental Ellipse • aims to discover all the possible correlated clusters with different size, density and elongation.
Behaviors of Distance Matrices in High–dim Space • KNN is meaningful in high-dim space? [1] • Furthest Neighbor/Nearest Neighbor is almost 1 -> poor discrimination [4] • One criterion as relative contrast:
Behaviors of Distance Matrices in High–dim Space • on different dimensionality for different matrices
Behaviors of Distance Matrices in High–dim Space • Relative Contrast on L-norm Matrices
Behaviors of Distance Matrices in High–dim Space • For higher dimensionality, the relative contrast provided by a norm with smaller parameter is more likely to dominate another with a larger parameter. • So L-norm Matrices with smaller parameter is a better choice for KNN searching in high-dim space.
Conclusion • Two Dimensionality Reduction Methods • GDR • LDR • Indexing Methods • Existing Structure • Global Indexing Structure • Side Effects of DR • Qualitative Improvement • Both intra-variance and inter-variance • Different behaviors for different matrices • Smaller k achieves higher quality
Future work • Propose a new Tree for real high dimensional indexing without reduction for dataset without correlations? • (Beneath iDistance, further prune the searching sphere using LB-Tree)? • Reduce the dim of data points which are the combination of multi-features, such as images (shape, color, text, etc).
References • [1]: Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim: On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. ICDT 2001:420-434 • [2]: Charu C. Aggarwal: On the Effects of Dimensionality Reduction on High Dimensional Similarity Search. PODS 2001 • [3]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim: What Is the Nearest Neighbor in High Dimensional Spaces? VLDB 2000: 506-515 • [4]: K.Beyer, J.Goldstein, R.Ramakrishnan, and U.Shaft.When is nearest neighbors meaningful? ICDT, 1999. • [5]: K.Chakrabart and S.Mehrotra.Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces.VLDB, pages 89--100, 2000. • [6]: R.Weber, H.Schek, and S.Blott. A Quantitative Analysis and Performance Study for Similarity Search Methods in High Dimensional Spaces. VLDB, pages 194--205, 1998. • [7]: C.Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the Distance: An Efficient Method to KNN Processing. VLDB, 2001.
References • [8]: K. V. R. Kanth, D. Agrawal, and A. K. Singh. Dimensionality reduction for similarity searching dynamic databases. SIGMOD, 1998. • [9]: Jon M. Kleinberg, Andrew Tomkins: Applications of Linear Algebra in Information Retrieval and Hypertext Analysis. PODS 1999: 185-193 • [10]: Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala: Latent Semantic Indexing: A Probabilistic Analysis. PODS 1998: 159-168 • [11]: Chris H.Q. Ding. A similarity-based Probability model for latent semantic indexing. SIGIR 1999: 59-65 • [12]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim. What is the nearest neighbor in high dimensional spaces? VLDB 2000