1 / 102

# Isaac Newton Institute - Cambridge

Isaac Newton Institute - Cambridge. Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina September 1, 2014. Personal Opinions on Mathematical Statistics. What is Mathematical Statistics? Validation of existing methods

## Isaac Newton Institute - Cambridge

E N D

### Presentation Transcript

1. Isaac Newton Institute - Cambridge Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina September 1, 2014

2. Personal Opinions on Mathematical Statistics What is Mathematical Statistics? • Validation of existing methods • Asymptotics (n  ∞) & Taylor expansion • Comparison of existing methods (requires hard math, but really “accounting”???)

3. Personal Opinions on Mathematical Statistics What could Mathematical Statistics be? • Basis for invention of new methods • Complicated data  mathematical ideas • Do we value creativity? • Since we don’t do this, others do… (where are the ₤₤₤s???)

4. Personal Opinions on Mathematical Statistics • Since we don’t do this, others do… • Pattern Recognition • Artificial Intelligence • Neural Nets • Data Mining • Machine Learning • ???

5. Personal Opinions on Mathematical Statistics Possible Litmus Test: Creative Statistics • Clinical Trials Viewpoint: Worst Imaginable Idea • Mathematical Statistics Viewpoint: ???

6. Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? • 1st Course: Numbers • Multivariate Analysis Course : Vectors • Functional Data Analysis: Curves • More generally: Data Objects

7. Object Oriented Data Analysis, II Examples: • Medical Image Analysis • Images as Data Objects? • Shape Representations as Objects • Micro-arrays • Just multivariate analysis?

8. Object Oriented Data Analysis, III Typical Goals: • Understanding population variation • Visualization • Principal Component Analysis + • Discrimination (a.k.a. Classification) • Time Series of Data Objects

9. Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) • Dimension d >> sample size n • “Multivariate Analysis” nearly useless • Can’t “normalize the data” • Land of Opportunity for Statisticians • Need for “creative statisticians”

10. Object Oriented Data Analysis, V Major Statistical Challenge, II: • Data may live in non-Euclidean space • Lie Group / Symmet’c Spaces (manifold data) • Trees/Graphs as data objects • Interesting Issues: • What is “the mean” (pop’n center)? • How do we quantify “pop’n variation”?

11. Statistics in Image Analysis, I First Generation Problems: • Denoising • Segmentation • Registration (all about single images)

12. Statistics in Image Analysis, II Second Generation Problems: • Populations of Images • Understanding Population Variation • Discrimination (a.k.a. Classification) • Complex Data Structures (& Spaces) • HDLSS Statistics

13. HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? • Complex 3-d Objects Hard to Represent • Often need d = 100’s of parameters • Complex 3-d Objects Costly to Segment • Often have n = 10’s cases

14. Medical Imaging – A Challenging Example • Male Pelvis • Bladder – Prostate – Rectum • How do they move over time (days)? • Critical to Radiation Treatment (cancer) • Work with 3-d CT • Very Challenging to Segment • Find boundary of each object? • Represent each Object?

15. Male Pelvis – Raw Data One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Prostate

16. Male Pelvis – Raw Data Prostate: manual segmentation Slice by slice Reassembled

17. Male Pelvis – Raw Data Prostate: Slices: Reassembled in 3d How to represent? Thanks: Ja-YeonJeong

18. Object Representation • Landmarks (hard to find) • Boundary Rep’ns (no correspondence) • Medial representations • Find “skeleton” • Discretize as “atoms” called M-reps

19. 3-d m-reps • Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes”  “surface”

20. 3-d m-reps • M-rep model fitting • Easy, when starting from binary (blue) • But very expensive (30 – 40 minutes technician’s time) • Want automatic approach • Challenging, because of poor contrast, noise, … • Need to borrow information across training sample • Use Bayes approach: prior & likelihood  posterior • ~Conjugate Gaussians, but there are issues: • MajorHLDSS challenges • Manifold aspect of data

21. PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)

22. PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…

23. PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

24. PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

25. PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

26. HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis • New Patients are “Healthy” or “Ill” • Determined based on measurements

27. HDLSS Classification (Cont.) • Ineffective Methods: • Fisher Linear Discrimination • Gaussian Likelihood Ratio • Less Useful Methods: • Nearest Neighbors • Neural Nets (“black boxes”, no “directions” or intuition)

28. HDLSS Classification (Cont.) • Currently Fashionable Methods: • Support Vector Machines • Trees Based Approaches • New High Tech Method • Distance Weighted Discrimination (DWD) • Specially designed for HDLSS data • Avoids “data piling” problem of SVM • Solves more suitable optimization problem

29. HDLSS Classification (Cont.) • Currently Fashionable Methods: • Trees Based Approaches • Support Vector Machines:

30. Distance Weighted Discrimination Maximal Data Piling

31. Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming • Still Convex gen’tion of quadratic prog’ing • Fast greedy solution • Can use existing software

32. DWD Bias Adjustment for Microarrays Microarray data: • Simult. Measur’ts of “gene expression” • Intrinsically HDLSS • Dimension d ~ 1,000s – 10,000s • Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”

33. DWD Batch and Source Adjustment • For Perou’s Stanford Breast Cancer Data • Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ • Adjust for Source Effects • Different sources of mRNA • Adjust for Batch Effects • Arrays fabricated at different times

34. DWD Adj: Raw Breast Cancer data

37. DWD Adj: Biological Class Colors

38. DWD Adj: Biological Class Colors & Symbols

39. DWD Adj: Biological Class Symbols

41. DWD Adj: PC 1-2 & DWD direction