1 / 102

Isaac Newton Institute - Cambridge

Isaac Newton Institute - Cambridge. Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina September 1, 2014. Personal Opinions on Mathematical Statistics. What is Mathematical Statistics? Validation of existing methods

gisela
Download Presentation

Isaac Newton Institute - Cambridge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Isaac Newton Institute - Cambridge Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina September 1, 2014

  2. Personal Opinions on Mathematical Statistics What is Mathematical Statistics? • Validation of existing methods • Asymptotics (n  ∞) & Taylor expansion • Comparison of existing methods (requires hard math, but really “accounting”???)

  3. Personal Opinions on Mathematical Statistics What could Mathematical Statistics be? • Basis for invention of new methods • Complicated data  mathematical ideas • Do we value creativity? • Since we don’t do this, others do… (where are the ₤₤₤s???)

  4. Personal Opinions on Mathematical Statistics • Since we don’t do this, others do… • Pattern Recognition • Artificial Intelligence • Neural Nets • Data Mining • Machine Learning • ???

  5. Personal Opinions on Mathematical Statistics Possible Litmus Test: Creative Statistics • Clinical Trials Viewpoint: Worst Imaginable Idea • Mathematical Statistics Viewpoint: ???

  6. Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? • 1st Course: Numbers • Multivariate Analysis Course : Vectors • Functional Data Analysis: Curves • More generally: Data Objects

  7. Object Oriented Data Analysis, II Examples: • Medical Image Analysis • Images as Data Objects? • Shape Representations as Objects • Micro-arrays • Just multivariate analysis?

  8. Object Oriented Data Analysis, III Typical Goals: • Understanding population variation • Visualization • Principal Component Analysis + • Discrimination (a.k.a. Classification) • Time Series of Data Objects

  9. Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) • Dimension d >> sample size n • “Multivariate Analysis” nearly useless • Can’t “normalize the data” • Land of Opportunity for Statisticians • Need for “creative statisticians”

  10. Object Oriented Data Analysis, V Major Statistical Challenge, II: • Data may live in non-Euclidean space • Lie Group / Symmet’c Spaces (manifold data) • Trees/Graphs as data objects • Interesting Issues: • What is “the mean” (pop’n center)? • How do we quantify “pop’n variation”?

  11. Statistics in Image Analysis, I First Generation Problems: • Denoising • Segmentation • Registration (all about single images)

  12. Statistics in Image Analysis, II Second Generation Problems: • Populations of Images • Understanding Population Variation • Discrimination (a.k.a. Classification) • Complex Data Structures (& Spaces) • HDLSS Statistics

  13. HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? • Complex 3-d Objects Hard to Represent • Often need d = 100’s of parameters • Complex 3-d Objects Costly to Segment • Often have n = 10’s cases

  14. Medical Imaging – A Challenging Example • Male Pelvis • Bladder – Prostate – Rectum • How do they move over time (days)? • Critical to Radiation Treatment (cancer) • Work with 3-d CT • Very Challenging to Segment • Find boundary of each object? • Represent each Object?

  15. Male Pelvis – Raw Data One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Prostate

  16. Male Pelvis – Raw Data Prostate: manual segmentation Slice by slice Reassembled

  17. Male Pelvis – Raw Data Prostate: Slices: Reassembled in 3d How to represent? Thanks: Ja-YeonJeong

  18. Object Representation • Landmarks (hard to find) • Boundary Rep’ns (no correspondence) • Medial representations • Find “skeleton” • Discretize as “atoms” called M-reps

  19. 3-d m-reps • Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes”  “surface”

  20. 3-d m-reps • M-rep model fitting • Easy, when starting from binary (blue) • But very expensive (30 – 40 minutes technician’s time) • Want automatic approach • Challenging, because of poor contrast, noise, … • Need to borrow information across training sample • Use Bayes approach: prior & likelihood  posterior • ~Conjugate Gaussians, but there are issues: • MajorHLDSS challenges • Manifold aspect of data

  21. PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)

  22. PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…

  23. PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

  24. PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

  25. PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

  26. HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis • New Patients are “Healthy” or “Ill” • Determined based on measurements

  27. HDLSS Classification (Cont.) • Ineffective Methods: • Fisher Linear Discrimination • Gaussian Likelihood Ratio • Less Useful Methods: • Nearest Neighbors • Neural Nets (“black boxes”, no “directions” or intuition)

  28. HDLSS Classification (Cont.) • Currently Fashionable Methods: • Support Vector Machines • Trees Based Approaches • New High Tech Method • Distance Weighted Discrimination (DWD) • Specially designed for HDLSS data • Avoids “data piling” problem of SVM • Solves more suitable optimization problem

  29. HDLSS Classification (Cont.) • Currently Fashionable Methods: • Trees Based Approaches • Support Vector Machines:

  30. Distance Weighted Discrimination Maximal Data Piling

  31. Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming • Still Convex gen’tion of quadratic prog’ing • Fast greedy solution • Can use existing software

  32. DWD Bias Adjustment for Microarrays Microarray data: • Simult. Measur’ts of “gene expression” • Intrinsically HDLSS • Dimension d ~ 1,000s – 10,000s • Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”

  33. DWD Batch and Source Adjustment • For Perou’s Stanford Breast Cancer Data • Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ • Adjust for Source Effects • Different sources of mRNA • Adjust for Batch Effects • Arrays fabricated at different times

  34. DWD Adj: Raw Breast Cancer data

  35. DWD Adj: Source Colors

  36. DWD Adj: Batch Colors

  37. DWD Adj: Biological Class Colors

  38. DWD Adj: Biological Class Colors & Symbols

  39. DWD Adj: Biological Class Symbols

  40. DWD Adj: Source Colors

  41. DWD Adj: PC 1-2 & DWD direction

  42. DWD Adj: DWD Source Adjustment

  43. DWD Adj: Source Adj’d, PCA view

  44. DWD Adj: Source Adj’d, Class Colored

  45. DWD Adj: Source Adj’d, Batch Colored

  46. DWD Adj: Source Adj’d, 5 PCs

  47. DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD

  48. DWD Adj: S. & B1,2 vs. 3 Adjusted

  49. DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

  50. DWD Adj: S. & B Adj’d, B1 vs. 2 DWD

More Related