Big Data ---a statistician’s perspective

Big Data ---a statistician’s perspective Ming Ji, PhD College of Nursing USF

Disclaimer • I am not an expert in big data and cannot cover all the developments in big data

Big Data is Here • What is Big Data? • Data that are too big to handle • Data that challenge existing technology and methods to store, process and analyze.

Examples of Big Data • Science Data (CERN) • National Survey Data (NHANES, NHIS, ACS, CPS, NHGIS) • Genomic Data (microarray, DNA sequencing, GWAS, microbiome) • Clinical Data (EHR) • Sensor Data (mHealth) • Social Media (Facebook, Twitter, LinkedIn, websites, blogs) • Climate Data (NOAA) • Financial Data (stock trading, banking, insurance, mortgage, credit cards )

Characteristics of Big Data • Volume • Velocity • Variety • Veracity

Generation of Big Data • Employee generated • User generated • Machine generated

Volume --- Big Data is Big • 2.7 Zetabytes of data exist in the digital universe today. • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. • In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day. • 100 terabytes of data uploaded daily to Facebook. • Data production will be 44 times greater in 2020 than it was in 2009. • In the last 5 years, more scientific data were generated than the total amount of data generated in previous human history.

Velocity • High speed of streaming data.

Variety • Besides the traditional structured data such as numerical data sets stored in relational databases, big data are everywhere and are in many different formats with some are unstructured. • Numerical data; audios; videos; text messaging data; websites; blogs; imaging data; genomic data; environmental data; climate data; clinical data; handwritings, etc.

Veracity • Bias • Uncertainty • Abnormality

Challenges of Big Data • Require new data systems to transfer, store and process big data (Hadoop, Storm, SAP, BigQuery, Amazon EC2) • Require data analysis methods of big data (big data analytics using data mining and statistical data analysis) • Challenges traditional statistics theory (Law of Large Samples, Central Limit Theorem, n<<p) • Challenges traditional scientific research method (prediction based vs mechanism based research, can big data replace traditional scientific research?)

Personal View: Big Data is Still Data • Big data must follow the same principles of data management • Data collection (streaming data, sensors, GigaScience) • Data storage (Oracle, SAP, IBM, EMC, Hadoop, Storm, BigQuery, Amazon EC2 and EMR) • Data format conversion (voice2txt, txt2voice, natural language processing from unstructured to structured) • Data integration ( data linkage, meta data) • Data privacy (privacy-preserved data mining, computer security)

Personal View: Big Data May Not Be Big Enough • GWAS studies do not identify any genetic mutation for disease prediction. • Whole genome sequencing fails to predict risk of most common diseases. BMJ 2012.

Personal View: Big Data Cannot Escape Statistics Principles • Collecting and analyzing data from any real world process must follow the same principles in statistical study design and data analysis. • Big sample size does not remove bias (<-sampling). • Big data may not be big enough (failure of predictive models from genomic data alone <- unmeasured confounders and underspecified models) • Not all the big data are useful and only a small subset is interesting to us --- find a needle in a hay stack(dimension reduction, Google’s MapReduce, real time data analysis)

Personal View: Big Data and Cybernetics • Big data will advance the further merging of humans and machines as predicted by Norbert Wiener on automation and human society. (wearable technology, machine intelligence, hybrid decision making ) • System Sciences and Information Theory may be good theoretical models to guide us build more big data systems for various applications (feedback, control, adaptation, information).

Personal View: Big data will boost computational sciences • Big data calls for new hardware and software for computation (GPU, cloud computing , DNA computing, quantum computing) • Big data calls for the next generation artificial intelligence to produce “ smarter algorithms” to handle big data because we humans cannot directly process big data. (Super Turing Machine)

The Future of Big Data: Hope or Hype? • We are at the cross road. The true effect of big data on human society is yet to be seen. • And we cannot use predictive analytics to predict the future of big data.

How do we use big data in our research? • Think Big: Can you use historically collected and archived big data ? (genomic data, large national surveys, NOAA climate data, etc.) • Think Measurement: Do you have measurement devices that can generate big data ? (sensors, images, videos, genomics, climates, etc.) • Think Multidisciplinary: Do you have experts from other disciplines (informatics, computer sciences, engineering, biology, mathematics, statistics, etc) to work on big data?

Case Studies of Big Data: IBM Watson • Sloan Kettering Cancer Center doctors are training IBM Watson to be an expert in cancer diagnosis and treatment based on learning: • Over 600,000 diagnostic reports • Two million pages of medical journal articles • One and a half million patient records • 14,700 hours of hands-on training

Case Study: Quantified Self lead by Larry Smarr • The Quantified Self Movement participants uses different devices to collect physical activity, sleep, diet, gut microbiome data to monitor their own health and use the data analysis results to work with their doctors for intervention. • Larry Smarr considers this as the future of disease prevention.

Case Study: Use big data to fight fraud in Medicare and Medicaid • CMS estimated that 65 billion dollars in Medicare and Medicaid lost to fraud in 2011 • Fraud detection algorithms are implemented in large claim data system to capture suspicious fraudulent cases. (Real time fraud detection, fraud detection using social network data) • Health Care Fraud and Abuse Control Program reported to have recovered 4.2 billion dollars

Big Data ---a statistician’s perspective

Big Data ---a statistician’s perspective

Presentation Transcript