1 / 41

Finding Local Correlations in High Dimensional Data

Finding Local Correlations in High Dimensional Data. USTC Seminar Xiang Zhang Case Western Reserve University. Finding Latent Patterns in High Dimensional Data. An important research problem with wide applications biology (gene expression analysis, genotype-phenotype association study)

Download Presentation

Finding Local Correlations in High Dimensional Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University

  2. Finding Latent Patterns in High Dimensional Data • An important research problem with wide applications • biology (gene expression analysis, genotype-phenotype association study) • customer transactions, and so on. • Common approaches • feature selection • feature transformation • subspace clustering

  3. Existing Approaches • Feature selection • find a single representative subset of features that are most relevant for the data mining task at hand • Feature transformation • find a set of new (transformed) features that contain the information in the original data as much as possible • Principal Component Analysis (PCA) • Correlation clustering • find clusters of data points that may not exist in the axis parallel subspaces but only exist in the projected subspaces.

  4. Question: How to find these local linear correlations (using existing methods)? Motivation Example linearly correlated genes

  5. Applying PCA — Correlated? • PCA is an effective way to determine whether a set of features is strongly correlated • a few eigenvectors describe most variance in the dataset • small amount of variance represented by the remaining eigenvectors • small residual variance indicates strong correlation • A global transformation applied to the entire dataset

  6. Applying PCA – Representation? • The linear correlation is represented as the hyperplane that is orthogonal to the eigenvectors with the minimum variances [1, -1, 1] embedded linear correlations linear correlations reestablished by full-dimensional PCA

  7. Applying Bi-clustering or Correlation Clustering Methods linearly correlated genes • Correlation clustering • no obvious clustering structure • Bi-clustering • no strong pair-wise correlations

  8. Revisiting Existing Work • Feature selection • finds only one representative subset of features • Feature transformation • performs one and the same feature transformation for the entire dataset • does not really eliminate the impact of any original attributes • Correlation clustering • projected subspaces are usually found by applying standard feature transformation method, such as PCA

  9. Local Linear Correlations - formalization • Idea: formalize local linear correlations as strongly correlated feature subsets • Determining if a feature subset is correlated • small residual variance • The correlation may not be supported by all data points -- noise, domain knowledge… • supported by a large portion of the data points

  10. Problem Formalization • Suppose that F (m by n) be a submatrix of the dataset D (M by N) • Let { } be the eigenvalues of the covariance matrix of F and arranged inascending order • F is strongly correlated feature subset if number of supporting data points variance on the k eigenvectors having smallest eigenvalues (residue variance) (1) and (2) total number of data points total variance

  11. Problem Formalization • Suppose that F (m by n) be a submatrix of the dataset D (M by N) larger k, stronger correlation Eigenvalues K andε, together control the strength of the correlation larger k smaller ε Eigenvalue id smaller ε, stronger correlation

  12. Goal • Goal: to find all strongly correlated feature subsets • Enumerate all sub-matrices? • Not feasible (2M×N sub-matrices in total) • Efficient algorithm needed • Any property we can use? • Monotonicity of the objective function

  13. Monotonicity • Monotonic w.r.t. the feature subset • If a feature subset is strongly correlated, all its supersets are also strongly correlated • Derived from Interlacing Eigenvalue Theorem • Allow us to focus on finding the smallest feature subsets that are strongly correlated • Enable efficient algorithm – no exhaustive enumeration needed

  14. The CARE Algorithm • Selecting the feature subsets • Enumerate feature subsets from smaller size to larger size (DFS or BFS) • If a feature subset is strongly correlated, then its supersets are pruned (monotonicity of the objective function) • Further pruning possible

  15. Monotonicity • Non-monotonic w.r.t. the point subset • Adding (or deleting) point from a feature subset can increase or decrease the correlation among the features • Exhaustive enumeration infeasible – effective heuristic needed

  16. The CARE Algorithm • Selecting the point subsets • Feature subset may only correlate on a subset of data points • If a feature subset is not strongly correlated on all data points, how to chose the proper point subset?

  17. The CARE Algorithm • Successive point deletion heuristic • greedy algorithm – in each iteration, delete the point that resulting the maximum increasing of the correlation among the subset of features • Inefficient – need to evaluate objective function for all data points

  18. The CARE Algorithm • Distance-based point deletion heuristic • Let S1be the subspace spanned by the k eigenvectors with the smallest eigenvalues • Let S2 be the subspace spanned by the remainingn-keigenvectors. • Intuition: Try to reduce the variance in S1 as much as possible while retaining the variance in S2 • Directly delete (1-δ)M points having large variance in S1and small varianceinS2 (refer to paper for details)

  19. The CARE Algorithm A comparison between two point deletion heuristics successive distance-based

  20. Experimental Results (Synthetic) Linear correlation embedded Linear correlation reestablished Full-dimensional PCA CARE

  21. Experimental Results (Synthetic) Linear correlation embedded (hyperplan representation) Pair-wise correlations

  22. Experimental Results (Synthetic) Scalability evaluation

  23. Experimental Results (Wage) Correlation clustering method & CARE CARE only A comparison between correlation clustering method and CARE (dataset (534×11) http://lib.stat.cmu.edu/datasets/CPS_85_Wages)

  24. Experimental Results Hspb2: cellular physiological process 2810453L12Rik: cellular physiological process 1010001D01Rik: cellular physiological process P213651: N/A Nrg4: cell part Myh7: cell part; intracelluar part Hist1h2bk: cell part; intracelluar part Arntl: cell part; intracelluar part Nrg4: integral to membrane Olfr281: integral to membrane Slco1a1: integral to membrane P196867: N/A Oazin: catalytic activity Ctse: catalytic activity Mgst3: catalytic activity Linearly correlated genes (Hyperplan representations) (220 genes for 42 mouse strains) Mgst3: catalytic activity; intracellular part Nr1d2: intracellular part; metal ion binding Ctse: catalytic activity Pgm3: metal ion binding Ldb3: intracellular part Sec61g: intracellular part Exosc4: intracellular part BC048403: N/A Ptk6: membrane Gucy2g: integral to membrane Clec2g: integral to membrane H2-Q2: integral to membrane Hspb2: cellular metabolism Sec61b: cellular metabolism Gucy2g: cellular metabolism Sdh1: cellular metabolism

  25. An example

  26. An example Result of applying PCA Result of applying ISOMAP

  27. Finding local correlations • Dimension reduction • performs a single feature transformation for the entire dataset • To find local correlations • First: identify the correlated feature subspaces • Then: apply dimension reduction methods to uncover the low dimensional structure • Dimension reduction addresses the second aspect • Our focus is the first aspect

  28. Finding local correlations • Challenges • Modeling subspace correlations • Measurements for pair-wise correlations may not suffice. • Searching algorithm • Exhaustive enumeration is too time consuming.

  29. Modeling correlated subspaces • Intrinsic dimensionality • the minimum number of free variables required to define the data without any significant information loss • Correlation dimension as ID estimator

  30. Modeling correlated subspaces • Strong correlation • subspace V and feature fa has strong correlation if • Redundancy • feature fvi in subspace V is redundant if

  31. Modeling correlated subspaces • Reducible Subspace and Core Space • subspace Y is reducible if there exist subspace V of Y, such that (1) (2) , U is non-redundant • V is the core space of Y, and Y is reducible toV the core space is the smallest non-redundant subspace Y, with which all other features in Y are strongly correlated all features in Y are strongly correlated with the cores space V

  32. Modeling correlated subspaces • Maximum reducible subspace • Y is a reducible subspace and V is its core space • Y is maximum if it includes all features that are strongly correlated with core space V • Goal • To find all maximum reducible subspaces in the full dimensional space

  33. Finding reducible subspaces • General idea • First find the overall reducible subspace (OR), which is the union of all maximum reducible subspaces • Then identify the individual maximum reducible subspaces (IR) from OR

  34. Finding OR • Property • suppose Y is a maximum reducible subspace with core space V, then any subspace U of Y, if |U|=|V|, U is also a core space of Y • Let RFfa be the remaining features in the datasets after deleting fa, then we have • A linear scan of all the features in the dataset can find OR

  35. Finding Individual RS • Assumption • maximum reducible subspaces are disjoint • Method • enumerate candidate core space from size 1 to |OR| • a candidate core space is a subset of OR • find features that are strongly correlated with candidate core space and remove them from OR

  36. Finding Individual RS • Determine if a feature is strongly correlated with candidate core space • ID-base method :quadratic to number of data points • Sampling based method: sample some data points and see the number of data points distributed around them • see paper for details

  37. Experimental result A synthetic dataset consisting of 50 features with 3 RS

  38. Experimental result Efficiency evaluation on finding OR

  39. Experimental result Sampling v.s. ID based method on finding Individual RS

  40. Experimental result Reducible subspaces in NBA dataset (from ESPN website) 28 features for 200 players

  41. Thank You !

More Related