Why Big Data Is Not All It’s Cracked Up To Be

Why Big Data Is Not All It’s Cracked Up To Be
Peter H. Westfall Paul Whitfield Horn Professor of Statistics, Texas Tech University

What Big Data is Cracked Up to Be The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. -The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, By Chris Anderson (2008), Wired

A Response That I Agree With … crucially important caveats are needed when using such datasets: caveats that, worryingly, seem to be frequently overlooked. Mark Graham’s reply, Datablog, 2012, from the blog entry “Big data and the End of Theory?”

Competing Views of Data Statistician Paradigm: Probabilistic view p(y | x, q)random DATA observed data Decisions under uncertainty ---------------------------------------------------------------------- “Data Scientist” Paradigm: Deterministic view data crunched data crunched data Decisions (uncertainty?)

Resistance to Probabilistic Views of Big Data “Data scientists” have limited training in probability and resist it. Other resistance: ‘Population data’: allis known from the data, nothing random Everything is statistically significant, but often meaningless because N is so large.

Why Probabilistic Modeling is Needed with Big Data Processes define big data, not vice versa: Population Big data becomes small data when sliced 3. The data you really need are not there 4. Really, n = 1 even with big data

1. Processes Define Big Data, Not Vice Versa Oldest living earthling is 124.5 (say). But 124.5 is biologically irrelevant. Big data are probabilistic; the “population model” fails. And it fails spectacularly with sliced data.

2. Slicing BIG DATA Gives small data Less Risk?

3. “Big Data” does not mean “Right Data” Ex: Credit scoring. DATA are outcomes of accepted applicants. Y = Repay/Not X’s = personal financial measures Logistic regression!

What You Haveand What You Want Have: Want:

Probabilistic MethodsNeeded For Selection Bias Imputation (Reject inference) Bivariate probit (Heckman’s selection model) … Standard statistical methods. Big data or not.

4. Really, n=1 Even With Big DATA ? ? ? ? ? ? ? ? ? ? ? “Space”, s ? DATA ? ? ? Past Present Future Time, t

Probabilistic Models for DATA Production p (y|x,t1,s1) p(y|x,t2,s1) “Space”, s p(y|x,t4,s3) p(y|x,t3,s2) Past Present Future Time, t

The Big Data Estimate of p(y|x,t,s)

And even after we estimate p(y|x,t,s) … An inconvenient truth: An even more inconvenient truth: p(y|x,t1,s1) p(y|x,t2,s2) Even with n =  (really BIG data!), the sample size is just 1 for generalizing to other spatio-temporal instances

Quantifying Generalizability The extent to which instance “A” generalizes to instance “B” is measured by the distance from A to B. Example: 30% of employees in company “A” need laptops. Generalizability to Company “B” = |30% - (B percentage)|

Quantifying Generalizability Fundamental notion: There is variation between instances (sb2), and there is variation within instances (s2) Generalizability is a function of both variances. Big data can reduce generalizability error within instances, not between.

When is sb2 “Big”? When generalizing from mouse to man When generalizing from before to after the housing crisis sb2 is small when generalizing from human biology today to human biology tomorrow

Conclusions Probability predicts data that are not there. Big data: Not all it’s cracked up to be because the data you need are typically not there. Probabilistic modeling is needed. “There is no greater increase in sample size than the increase from one to two.” -John Tukey.

Why Big Data Is Not All It’s Cracked Up To Be

Why Big Data Is Not All It’s Cracked Up To Be

Presentation Transcript

Why Big Data Is Not All It’s Cracked Up To Be

Chinese cracked pot

CHANGES TO SPECIFICATION 400-21, DISPOSITION OF CRACKED CAST-IN-PLACE CONCRETE

Blood supply to the Eye

Data Mining: Data Preparation

An Introduction of Big Data

Teaching clinical communication - an amazing effort Yet why do we still feel we haven’t cracked it? Jonathan Silverman

UNIT 19

Spare Car Keys

Rib Injury and Golf

Oral English Training

How well do you know your DATA?

Data, Data, and more Data

BIG DATA IN ENGINEERING APPLICATIONS

Cracked

Damn ! Data !

What’s New in Data?

From raw data to rich data

Start a Career that’s All It’s Cracked Up to Be!

Windshield Replacement – Different Options