1 / 102

1.05k likes | 1.24k Views

Isaac Newton Institute - Cambridge. Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina September 1, 2014. Personal Opinions on Mathematical Statistics. What is Mathematical Statistics? Validation of existing methods

Download Presentation
## Isaac Newton Institute - Cambridge

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Isaac Newton Institute - Cambridge**Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina September 1, 2014**Personal Opinions on Mathematical Statistics**What is Mathematical Statistics? • Validation of existing methods • Asymptotics (n ∞) & Taylor expansion • Comparison of existing methods (requires hard math, but really “accounting”???)**Personal Opinions on Mathematical Statistics**What could Mathematical Statistics be? • Basis for invention of new methods • Complicated data mathematical ideas • Do we value creativity? • Since we don’t do this, others do… (where are the ₤₤₤s???)**Personal Opinions on Mathematical Statistics**• Since we don’t do this, others do… • Pattern Recognition • Artificial Intelligence • Neural Nets • Data Mining • Machine Learning • ???**Personal Opinions on Mathematical Statistics**Possible Litmus Test: Creative Statistics • Clinical Trials Viewpoint: Worst Imaginable Idea • Mathematical Statistics Viewpoint: ???**Object Oriented Data Analysis, I**What is the “atom” of a statistical analysis? • 1st Course: Numbers • Multivariate Analysis Course : Vectors • Functional Data Analysis: Curves • More generally: Data Objects**Object Oriented Data Analysis, II**Examples: • Medical Image Analysis • Images as Data Objects? • Shape Representations as Objects • Micro-arrays • Just multivariate analysis?**Object Oriented Data Analysis, III**Typical Goals: • Understanding population variation • Visualization • Principal Component Analysis + • Discrimination (a.k.a. Classification) • Time Series of Data Objects**Object Oriented Data Analysis, IV**Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) • Dimension d >> sample size n • “Multivariate Analysis” nearly useless • Can’t “normalize the data” • Land of Opportunity for Statisticians • Need for “creative statisticians”**Object Oriented Data Analysis, V**Major Statistical Challenge, II: • Data may live in non-Euclidean space • Lie Group / Symmet’c Spaces (manifold data) • Trees/Graphs as data objects • Interesting Issues: • What is “the mean” (pop’n center)? • How do we quantify “pop’n variation”?**Statistics in Image Analysis, I**First Generation Problems: • Denoising • Segmentation • Registration (all about single images)**Statistics in Image Analysis, II**Second Generation Problems: • Populations of Images • Understanding Population Variation • Discrimination (a.k.a. Classification) • Complex Data Structures (& Spaces) • HDLSS Statistics**HDLSS Statistics in Imaging**Why HDLSS (High Dim, Low Sample Size)? • Complex 3-d Objects Hard to Represent • Often need d = 100’s of parameters • Complex 3-d Objects Costly to Segment • Often have n = 10’s cases**Medical Imaging – A Challenging Example**• Male Pelvis • Bladder – Prostate – Rectum • How do they move over time (days)? • Critical to Radiation Treatment (cancer) • Work with 3-d CT • Very Challenging to Segment • Find boundary of each object? • Represent each Object?**Male Pelvis – Raw Data**One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Prostate**Male Pelvis – Raw Data**Prostate: manual segmentation Slice by slice Reassembled**Male Pelvis – Raw Data**Prostate: Slices: Reassembled in 3d How to represent? Thanks: Ja-YeonJeong**Object Representation**• Landmarks (hard to find) • Boundary Rep’ns (no correspondence) • Medial representations • Find “skeleton” • Discretize as “atoms” called M-reps**3-d m-reps**• Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes” “surface”**3-d m-reps**• M-rep model fitting • Easy, when starting from binary (blue) • But very expensive (30 – 40 minutes technician’s time) • Want automatic approach • Challenging, because of poor contrast, noise, … • Need to borrow information across training sample • Use Bayes approach: prior & likelihood posterior • ~Conjugate Gaussians, but there are issues: • MajorHLDSS challenges • Manifold aspect of data**PCA for m-reps, I**Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)**PCA for m-reps, II**PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…**PGA for m-reps, Bladder-Prostate-Rectum**Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)**PGA for m-reps, Bladder-Prostate-Rectum**Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)**PGA for m-reps, Bladder-Prostate-Rectum**Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)**HDLSS Classification (i.e. Discrimination)**Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis • New Patients are “Healthy” or “Ill” • Determined based on measurements**HDLSS Classification (Cont.)**• Ineffective Methods: • Fisher Linear Discrimination • Gaussian Likelihood Ratio • Less Useful Methods: • Nearest Neighbors • Neural Nets (“black boxes”, no “directions” or intuition)**HDLSS Classification (Cont.)**• Currently Fashionable Methods: • Support Vector Machines • Trees Based Approaches • New High Tech Method • Distance Weighted Discrimination (DWD) • Specially designed for HDLSS data • Avoids “data piling” problem of SVM • Solves more suitable optimization problem**HDLSS Classification (Cont.)**• Currently Fashionable Methods: • Trees Based Approaches • Support Vector Machines:**Distance Weighted Discrimination**Maximal Data Piling**Distance Weighted Discrimination**Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming • Still Convex gen’tion of quadratic prog’ing • Fast greedy solution • Can use existing software**DWD Bias Adjustment for Microarrays**Microarray data: • Simult. Measur’ts of “gene expression” • Intrinsically HDLSS • Dimension d ~ 1,000s – 10,000s • Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”**DWD Batch and Source Adjustment**• For Perou’s Stanford Breast Cancer Data • Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ • Adjust for Source Effects • Different sources of mRNA • Adjust for Batch Effects • Arrays fabricated at different times

More Related