1 / 35

High-dimensional Indexing based on Dimensionality Reduction

High-dimensional Indexing based on Dimensionality Reduction . Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi. Outlines . Introduction Global Dimensionality Reduction Local Dimensionality Reduction Indexing Reduced-Dim Space

tuvya
Download Presentation

High-dimensional Indexing based on Dimensionality Reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-dimensional Indexing based on Dimensionality Reduction Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi

  2. Outlines • Introduction • Global Dimensionality Reduction • Local Dimensionality Reduction • Indexing Reduced-Dim Space • Effects of Dimensionality Reduction • Behaviors of Distance Matrices • Conclusion and Future Works

  3. Introduction • High-Dim Applications: • Multimedia, time-series, scientific, market basket, etc. • Various Trees Proposed: • R-tree, R*, R+, X, Skd, SS, M, KDB, TV, Buddy, Grid File, Hybrid, iDistance, etc. • Dimensionality Curse • Efficiency drops quickly as dim increases.

  4. Introduction • Dimensionality Reduction Techniques • GDR • LDR • High-Dim Indexing on RDS • Existing Indexing on single RDS • Global Indexing on multiple RDS • Side Effects of DR • Different Behaviors of Distance Matrices • Conclusion Future Work

  5. GDR • Perform Reduction on the whole dataset.

  6. GDR Improving query accuracy by doing principal components analysis (PCA)

  7. GDR • Using Aggregate Data for Reduction in Dynamic Spaces [8].

  8. GDR • Works for Globally Correlated data. • GDR may cause significant info loss in real data.

  9. LDR [5] • Find locally correlated data clusters • Perform dimensionality reduction on on the clusters individually

  10. LDR - Definitions • Cluster and subspace • Reconstruction Distance

  11. LDR - Constraints on cluster • Reconstruction distance bound I.e. MaxReconDist • Dimensionality bound I.e. MaxDim • Size Bound I.e. MinSize

  12. LDR - Clustering Algo • Construct spatial clusters • Determine max number of clusters: M • Determine the cluster range: e • Choose a set of well scattered points as the centroids (C) of each spatial cluster • Apply the formula to all data points: Distance (P, Cclosest) <= e • Update the centroids of the cluster

  13. LDR - Clustering Algo (cont) • Compute principal component (PC) • Perform PCA individually to all clusters • Compute mean value of each cluster points, I.e. Ei • Determine subspace dimensionality • Progressively checking each point against: MaxReconDist and MaxDim • Decide the optimal demensionality for each cluster

  14. LDR - Clustering Algo (cont) • Recluster points • Insert each points into the a suitable cluster or the outlier set O I.e. ReconDist(P.S) <= MaxReconDist

  15. LDR - Clustering Algo (cont) • Finally, apply the Size Bound to eliminate clusters with too few population. Redistribute the points to other clusters or set O.

  16. LDR - Compare to GDR • LDR improves retrieval efficiency and effectiveness by capture more details on local data set. • But it consumes higher computational cost during the reduction steps.

  17. LDR • LDR cannot discover all the possible correlated clusters.

  18. Indexing RDS • GDR • One RDS only • Applying existing multi-dim indexing structure, e.g. R-tree, M-Tree… • LDR • Several RDS in different axis systems • Global Indexing Structure

  19. Global Indexing Each RDS corresponds to one tree.

  20. Side Effects of DR • Information loss -> Lower precision • Possible Improvement? • Text Domain • DR -> qualitative improvement • Least information loss -> highest precision -> Highest qualitative improvement

  21. Sim for doc Sim for term & correlation Side Effects of DR • Latent Semantic Indexing (U & V) (LSI) [9,10,11]

  22. Side Effects of DR • DR effectively improve the data representation by understanding the data in terms of concepts rather than words. • Directions with greatest variance results in the use of Semantic aspects of data.

  23. Side Effects of DR • Dependency among attributes results in poor measurements if using L-norm matrices. • Dimensions with largest eigenvalues = highest quality [2]. • So what else we have to consider?. Inter-correlations

  24. Mahalanobis Distance Normalized Mahalanobis Distance

  25. Mahalanobis vs. L-norm

  26. Mahalanobis vs. L-norm • Take local shape into consideration by computing variance and covariance. • Tend to group points into elliptical clusters, which defines a multi-dim space whose boundaries determine the range of degree of correlation that is suitable for dim reduction. • Define the standard deviation boundary of the cluster.

  27. Incremental Ellipse • aims to discover all the possible correlated clusters with different size, density and elongation.

  28. Behaviors of Distance Matrices in High–dim Space • KNN is meaningful in high-dim space? [1] • Furthest Neighbor/Nearest Neighbor is almost 1 -> poor discrimination [4] • One criterion as relative contrast:

  29. Behaviors of Distance Matrices in High–dim Space • on different dimensionality for different matrices

  30. Behaviors of Distance Matrices in High–dim Space • Relative Contrast on L-norm Matrices

  31. Behaviors of Distance Matrices in High–dim Space • For higher dimensionality, the relative contrast provided by a norm with smaller parameter is more likely to dominate another with a larger parameter. • So L-norm Matrices with smaller parameter is a better choice for KNN searching in high-dim space.

  32. Conclusion • Two Dimensionality Reduction Methods • GDR • LDR • Indexing Methods • Existing Structure • Global Indexing Structure • Side Effects of DR • Qualitative Improvement • Both intra-variance and inter-variance • Different behaviors for different matrices • Smaller k achieves higher quality

  33. Future work • Propose a new Tree for real high dimensional indexing without reduction for dataset without correlations? • (Beneath iDistance, further prune the searching sphere using LB-Tree)? • Reduce the dim of data points which are the combination of multi-features, such as images (shape, color, text, etc).

  34. References • [1]: Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim: On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. ICDT 2001:420-434 • [2]: Charu C. Aggarwal: On the Effects of Dimensionality Reduction on High Dimensional Similarity Search. PODS 2001 • [3]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim: What Is the Nearest Neighbor in High Dimensional Spaces? VLDB 2000: 506-515 • [4]: K.Beyer, J.Goldstein, R.Ramakrishnan, and U.Shaft.When is nearest neighbors meaningful? ICDT, 1999. • [5]: K.Chakrabart and S.Mehrotra.Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces.VLDB, pages 89--100, 2000. • [6]: R.Weber, H.Schek, and S.Blott. A Quantitative Analysis and Performance Study for Similarity Search Methods in High Dimensional Spaces. VLDB, pages 194--205, 1998. • [7]: C.Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the Distance: An Efficient Method to KNN Processing. VLDB, 2001.

  35. References • [8]: K. V. R. Kanth, D. Agrawal, and A. K. Singh. Dimensionality reduction for similarity searching dynamic databases. SIGMOD, 1998. • [9]: Jon M. Kleinberg, Andrew Tomkins: Applications of Linear Algebra in Information Retrieval and Hypertext Analysis. PODS 1999: 185-193 • [10]: Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala: Latent Semantic Indexing: A Probabilistic Analysis. PODS 1998: 159-168 • [11]: Chris H.Q. Ding. A similarity-based Probability model for latent semantic indexing. SIGIR 1999: 59-65 • [12]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim. What is the nearest neighbor in high dimensional spaces? VLDB 2000

More Related