A Statistician's View of Upcoming Grand Challenges

A Statistician's View of Upcoming Grand Challenges Alanna Connors Imputed by Xiao-Li Meng Joint work with Alex Blocker, Paul Baines Vinay Kashyap, Pavlos Protopapas, and Andreas Zezas (all members of CBAS, a.k.a CHASC)

I. Assessing Uncertainty When We Have No Idea What We Are Doing! • OK, maybe we know a little bit or a little piece of it • Genuine replications are NOT possible • Create pseudo-replications • Bootstrap (the Green Book by Efron and Tibshirani, 1994) • Posterior Predictive Replications (Rubin, 1984, Annals of Statistics; Gelman, Meng and Stern, Statistica Sinica, 1996) • Data Perturbation-- taking derivative with respect to data “On Measuring and Correcting the Effects of Data Mining and Model Selection ” (J. Ye, 1998, 120-131, J. of American Statistical Association)

II. “Black Box” Inference and Computation • The likelihood is given as a “black box” (either as a computer routine or a look-up table); • The prior is given the same way or we can simulate from a prior; • And we want samples from the Bayesain posterior. • Easy, right? Using Metropolis-Hasting, with prior as proposal, and likelihood as the M-H ratio … • Useless, since the posterior typically will be quite different (we at least hope!) from the prior, so the Markov chain won’t converge/mix, especially for high-dimensional problems … • So what do we do?

We need to adaptivelyblend many advanced methods • Parallel Tempering (Geyer, 1991, Proc. 23rd Symposium of CS & Stat Interface) • Equi-energy Sampling (Kou, Zhou & Wong, 2006, with discussions, Annals of Statistics) • Ancillarity-Sufficiency Interweaving Strategy (ASIS) (Yu and Meng, 2010, with discussions, J. Computational and Graphical Statistics) • AND, we need to know how to “cut corners” …

Example: Color-Magnitude Diagrams (Baines, Zezas, Kashyap) • Goal: Estimate the mass, age (and possibly metallicity) of a cluster of stars • Parameters: Mass, Age, Metallicity • Data: Photometric data • Theory/Likelihood: Isochrones (Tables) • The isochrones connect the scientifically interesting parameters to the observed data via a complicated mapping

A Colorful But Ugly Likelihood

We want an Equi-Energy (EE) Sampler … • Jump between points of equal density/probabity (or “energy”)

Approximate EE by “Equi-Expectation” • Implementing the Equi-Energy Sampler in high dimensions is impractical • Idea: Use the structure of the problem to construct a low-dimensional and efficient approximation to EE • For Gaussian-like data, “Equi-Expectation” clusters approximate “Equi-Energy” clusters • “Equi-Expectation” clusters are data (e.g. star) independent, and hence require one-time pre-MCMC step

The original parameter space (e.g., magnitude & color)

The “rocking boat” represents the “Expectation Space”

Clustering on the “Expectation Space”

Creating approximate “equi-energy” clusters on the original space

III. Many Frustrations!!! • Outliers, really extreme ones! • Large, Long tailed measurement errors • Strong dependence • Non-linear trends (or whatever you want to call them) • Confounding signals (e.g., quasi-periodic) • High dimensions • Too much data • Too many variables (large p, small n) • Too little data (there is always ONE observable universe and ONE entire history!) • Too little funding, too little time …

Example: Event Detection in Time Series (Alex Blocker and Pavlos Protopapas)

Use all your tools, but in the right order! • Do some pre-processing (e.g., scan statistics) to reduce computational burden, but with GREAT CAUTION • Be aware of the artifacts innocent-looking methods may introduce (e.g., spurious correlations); Always try on test data first! • Let more rigorous statistical models to take care of complications firstwhenever the computation is feasible • Take advantage of more ad-hoc methods when signal is relative strong and computational gain is great • Don’t forget to do model checking and uncertainty assessment via pseudo replications!

A two-stage approach for event detection • We fit a statistical model to separate low-frequency trends L, median-frequency “candidates” M (event or quasi-periodic), and white noise N – we use t-model with small df (e.g. 3) to deal with outliers: • Y(t) = ∑aiMi(t) + ∑bjLj(t) + N(t) • Once the data are reduced to cleaner (e.g., outliers and non-linear trends removed) and lower dimension feature vector: { ai, i=1, …, I}, we can use a classifier to separate isolated events from quasi-periodic by training on previously identified light curves from each category.

Cutting Corners: even the simple Haar wavelets might do the job …

The Grandest Challenge of All … • We need many more future talents who are passionate about quantitative sciences • And who will stay away from the Wall Street regardless of the economy! • So what do we do? • Better teaching and training! • “Desired and Feared– What Do We Do Now and Over the Next 50 Years? ” Am. Stat., 2009, Aug. • “Real-life statistics: Your Chance for Happiness (or Misery)” Amstat News, 2009, Sept.

Happy Team

They are intoxicated by …

A Statistician's View of Upcoming Grand Challenges