1 / 46

Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time. Distance Weighted Discrimination: Revisit microarray data Face Data Outcomes Data Simulation Comparison. Twiddle ratios of subtypes. Why not adjust by means?. DWD robust against non-proportional subtypes… Mathematical Statistical Question:

roxy
Download Presentation

Object Orie’d Data Analysis, Last Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object Orie’d Data Analysis, Last Time Distance Weighted Discrimination: • Revisit microarray data • Face Data • Outcomes Data • Simulation Comparison

  2. Twiddle ratios of subtypes

  3. Why not adjust by means? DWD robust against non-proportional subtypes… Mathematical Statistical Question: Are there mathematics behind this? (will answer next time…)

  4. Distance Weighted Discrim’n Maximal Data Piling

  5. HDLSS Discrim’n Simulations Main idea: Comparison of • SVM (Support Vector Machine) • DWD (Distance Weighted Discrimination) • MD (Mean Difference, a.k.a. Centroid) Linear versions, across dimensions

  6. HDLSS Discrim’n Simulations Conclusions: • Everything (sensible) is best sometimes • DWD often very near best • MD weak beyond Gaussian Caution about simulations (and examples): • Very easy to cherry pick best ones • Good practice in Machine Learning • “Ignore method proposed, but read paper for useful comparison of others”

  7. HDLSS Discrim’n Simulations Can we say more about: All methods come together in very high dimensions??? Mathematical Statistical Question: Mathematics behind this??? (will answer now)

  8. HDLSSAsymptotics Modern Mathematical Statistics: Based on asymptotic analysis I.e. Uses limiting operations Almost always Occasional misconceptions: Indicates behavior for large samples Thus only makes sense for “large” samples Models phenomenon of “increasing data” So other flavors are useless???

  9. HDLSSAsymptotics Modern Mathematical Statistics: Based on asymptotic analysis Real Reasons: Approximation provides insights Can find simple underlying structure In complex situations Thus various flavors are fine: Even desirable! (find additional insights)

  10. HDLSSAsymptotics: Simple Paradoxes For dim’al Standard Normal dist’n: Euclidean Distance to Origin (as ):

  11. HDLSSAsymptotics: Simple Paradoxes As , Data lie roughly on surface of sphere, with radius - Yet origin is point of highest density??? - Paradox resolved by: density w. r. t. Lebesgue Measure

  12. HDLSSAsymptotics: Simple Paradoxes For dim’al Standard Normal dist’n: indep. of Euclidean Dist. Between and (as ): Distance tends to non-random constant:

  13. HDLSSAsymptotics: Simple Paradoxes Distance tends to non-random constant: Factor , since Can extend to Where do they all go??? (we can only perceive 3 dim’ns)

  14. HDLSSAsymptotics: Simple Paradoxes For dim’al Standard Normal dist’n: indep. of High dim’al Angles (as ): - Everything is orthogonal??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random

  15. HDLSSAsy’s: Geometrical Represent’n Assume , let Study Subspace Generated by Data Hyperplane through 0, of dimension Points are “nearly equidistant to 0”, & dist Within plane, can “rotate towards Unit Simplex” All Gaussian data sets are: “near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex Hall, Marron & Neeman (2005)

  16. HDLSSAsy’s: Geometrical Represent’n Assume , let Study Hyperplane Generated by Data dimensional hyperplane Points are pairwise equidistant, dist Points lie at vertices of: “regular hedron” Again “randomness in data” is only in rotation Surprisingly rigid structure in data?

  17. HDLSSAsy’s: Geometrical Represen’tion Simulation View: study “rigidity after rotation” Simple 3 point data sets In dimensions d = 2, 20, 200, 20000 Generate hyperplane of dimension 2 Rotate that to plane of screen Rotate within plane, to make “comparable” Repeat 10 times, use different colors

  18. HDLSSAsy’s: Geometrical Represen’tion Simulation View: shows “rigidity after rotation”

  19. HDLSSAsy’s: Geometrical Represen’tion Explanation of Observed (Simulation) Behavior: “everything similar for very high d ” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN

  20. HDLSSAsy’s: Geometrical Represen’tion Straightforward Generalizations: non-Gaussian data: only need moments non-independent: use “mixing conditions” Mild Eigenvalue condition on Theoretical Cov. (Ahn, Marron, Muller & Chi, 2007) All based on simple “Laws of Large Numbers”

  21. 2nd Paper on HDLSSAsymptotics Ahn, Marron, Muller & Chi (2007) • Assume 2nd Moments • Assume no eigenvalues too large in sense: For assume i.e. (min possible) (much weaker than previous mixing conditions…)

  22. 2nd Paper on HDLSSAsymptotics Background: In classical multivariate analysis, the statistic Is called the “epsilon statistic” And is used to test “sphericity” of dist’n, i.e. “are all cov’nce eigenvalues the same?”

  23. 2nd Paper on HDLSSAsymptotics Can show: epsilon statistic: Satisfies: • For spherical Normal, • Single extreme eigenvalue gives • So assumption is very mild • Much weaker than mixing conditions

  24. 2nd Paper on HDLSSAsymptotics Ahn, Marron, Muller & Chi (2007) • Assume 2nd Moments • Assume no eigenvalues too large, : Then Not so strong as before:

  25. 2nd Paper on HDLSSAsymptotics Can we improve on: ? John Kent example: Normal scale mixture Won’t get:

  26. 2nd Paper on HDLSSAsymptotics Notes on Kent’s Normal Scale Mixture • Data Vectors are indep’dent of each other • But entries of each have strong depend’ce • However, can show entries have cov = 0! • Recall statistical folklore: Covariance = 0 Independence

  27. 0 Covariance is not independence Simple Example: Random Variables and Make both Gaussian With strong dependence Yet 0 covariance Given , define

  28. 0 Covariance is not independence Simple Example:

  29. 0 Covariance is not independence Simple Example:

  30. 0 Covariance is not independence Simple Example, c to make cov(X,Y) = 0

  31. 0 Covariance is not independence Simple Example: Distribution is degenerate Supported on diagonal lines Not abs. cont. w.r.t. 2-d Lebesgue meas. For small , have For large , have By continuity, with

  32. 0 Covariance is not independence Result: Joint distribution of and : Has Gaussian marginals Has Yet strong dependence of and Thus not multivariate Gaussian Shows Multivariate Gaussian means more than Gaussian Marginals

  33. HDLSSAsy’s: Geometrical Represen’tion Further Consequences of Geometric Represen’tion 1. Inefficiency of DWD for uneven sample size (motivates weighted version, Xingye Qiao) 2. DWD more stable than SVM (based on deeper limiting distributions) (reflects intuitive idea feeling sampling variation) (something like mean vs. median) 3. 1-NN rule inefficiency is quantified.

  34. HDLSS Math. Stat. of PCA, I Consistency & Strong Inconsistency: Spike Covariance Model, Paul (2007) For Eigenvalues: 1st Eigenvector: How good are empirical versions, as estimates?

  35. HDLSSMath. Stat. of PCA, II Consistency (big enough spike): For , Strong Inconsistency (spike not big enough): For ,

  36. HDLSSMath. Stat. of PCA, III Consistency of eigenvalues? • Eigenvalues Inconsistent • But known distribution • Unless as well

  37. HDLSSWork in Progress, I Batch Adjustment: Xuxin Liu Recall Intuition from above: • Key is sizes of biological subtypes • Differing ratio trips up mean • But DWD more robust Mathematics behind this?

  38. Liu: Twiddle ratios of subtypes

  39. HDLSSData Combo Mathematics Xuxin Liu Dissertation Results: • Simple Unbalanced Cluster Model • Growing at rate as • Answers depend on Visualization of setting….

  40. HDLSSData Combo Mathematics

  41. HDLSSData Combo Mathematics

  42. HDLSSData Combo Mathematics Asymptotic Results (as ): • For , DWD Consistent Angle(DWD,Truth) • For , DWD Strongly Inconsistent Angle(DWD,Truth)

  43. HDLSSData Combo Mathematics Asymptotic Results (as ): • For , PAM Inconsistent Angle(PAM,Truth) • For , PAM Strongly Inconsistent Angle(PAM,Truth)

  44. HDLSSData Combo Mathematics Value of , for sample size ratio : • , only when • Otherwise for , PAM Inconsistent • Verifies intuitive idea in strong way

  45. The Future of Geometrical Repres’tion? HDLSSversion of “optimality” results? “Contiguity” approach? Params depend on d? Rates of Convergence? Improvements of DWD? (e.g. other functions of distance than inverse) It is still early days …

  46. State of HDLSS Research? Development Of Methods Mathematical Assessment … (thanks to: defiant.corban.edu/gtipton/net-fun/iceberg.html)

More Related