1 / 22

A Statistical Viewpoint on Data Science, Data Mining and Big Data

A Statistical Viewpoint on Data Science, Data Mining and Big Data. Alec Stephenson. DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES. Introduction. Statistics Vs Data Science Statistician Vs Data Scientist Data Science in Predictive Analytics Data Science in Consulting

diep
Download Presentation

A Statistical Viewpoint on Data Science, Data Mining and Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

  2. Introduction • Statistics Vs Data Science • Statistician Vs Data Scientist • Data Science in Predictive Analytics • Data Science in Consulting • Big Data: Are Statisticians Relevant?

  3. Data Science Venn Diagram (Drew Conway)

  4. Statistician Vs Data Scientist

  5. I am a Data Scientist • On Linkedin • On my email signature • To market myself to internal and external clients I am a Statistician • Atacademic conferences • Providing expertise for journal articles • Any role as a technical expert

  6. Is There A Greater Demand For Data Scientists? • Experfy www.experfy.com • Melbourne Data Science Meet-Up www.meetup.com/Data-Science-Melbourne/ BUT: • Kaggle Connect No longer exists (March-December 2013)

  7. Data Science Skills Essential: • Statistical Modelling: e.g. R, Matlab, Python • Data Munging: e.g. Perl, Python, Ruby Additional: • Fast Computation: C, C++, Java • Data Storage: SQL, noSQL • Big Data: MapReduce, Mahout, Hive, Pig

  8. Data Mining Competitions www.kaggle.com Good For Building Essential Skills In Predictive Analytics Only Three Steps To Winning: • Data Munging • Machine Learning / Statistical Modelling • Ensembling

  9. Data Mining Competitions www.kaggle.com General Advice: • Just because you have data, does not mean that you have to use it • There is no such thing a single best model • Different models can capture different features • Visualize the data

  10. Data Mining Competitions www.kaggle.com General Advice: • If something takes more than one minute to run, do you really need to run it? • Spend more time on trying different data transformations and models, and less on parameter specification • Just have a go. • How much time can you afford?

  11. Data Mining Competitions www.kaggle.com Usually Good Methods: • Gradient boosting machine (gbm / mboost) • Random forest (randomForest) • Elastic net (glmnet) • Support Vector Machine (kernlab / e1071) • Neural networks (nnet)

  12. Data Mining Competitions www.kaggle.com Usually Not So-Good Methods: • Recursive Partitioning (rpart / tree) • Nearest neighbour (class) • Multivariate Adaptive Regression Splines (earth) • Naive Bayes (e1071)

  13. Data Mining Example I library(randomForest) library(gbm) library(glmnet) data <- as.matrix(iris[,-5]) set.seed(100) ind <- sample(150, 15) train <- data[-ind,] test <- data[ind,]

  14. Data Mining Example II set.seed(100) m1 <- randomForest(train[,2:4], train[,1], ntree = 1000, mtry = 2) pm1 <- predict(m1, test[,-1]) mean((pm1 - test[,1])^2) set.seed(100) m2 <- gbm.fit(train[,2:4], train[,1], distribution = "gaussian", n.trees = 10000, shrinkage = 0.001, interaction.depth= 2) pm2 <- predict(m2, test[,-1], n.trees = 10000) mean((pm2 - test[,1])^2) set.seed(100) m3 <- glmnet(train[,2:4], train[,1], family = "gaussian", alpha = 0.5) pm3 <- predict(m3, test[,-1]) pm3 <- pm3[,ncol(pm3)] mean((pm3 - test[,1])^2)

  15. Data Mining Example III mean(((pm1 + pm2)/2 - test[,1])^2) mean(((pm1 + pm3)/2 - test[,1])^2) mean(((pm2 + pm3)/2 - test[,1])^2) mean(((pm1 + pm2 + pm3)/3 - test[,1])^2)

  16. Prediction: Competitions Vs Clients • Predictive analytics is a black box • Simplicity vs Predictive Accuracy • Communication with client • Reporting: methods or conclusions • Variable Importance • Client Implementation

  17. Big Data • Means different things to different people • SKA: 10 petabytes per hour by 2025 • Statisticians typically consider a few gigabytes to be a huge dataset • Do statisticians have a role to play?

  18. Big Data 3V’s: Volume Velocity Variety • Volume: MB, GB, TB, PB, ... • Velocity: Real-Time, Hourly, Weekly, Batch, • Variety: Structured, Unstructured • Veracity: How accurate? • Value: How valuable?

  19. Gartner Hype Cycle 2013

  20. Big Data: A typical statistician… • Will say that they are heavily involved in big data • Will use big data for marketing purposes • Will never have programmed a MapReduce job • Will have never used datasets of 0.5TB+ • Will not know about big data technologies • Why is this?

  21. Statisticians may have a role in • Deciding what data is relevant to the question • Subsetting and sampling big data • Modelling these subsets Statistician may not have a role • If you need to touch all of the data (0.5TB+) • Restriction to linear (or linearithmic) algorithms • Sums / Averages / Graph Search / Sorting

  22. Big Data??? Robust Statistics and Extremes 8 – 11 September, 2014 Australian National University Statistics today is faced with many challenges, especially relating to such topical issues as the analysis of "big data" through to understanding the complexities of climate change - and many others. Floods, fires, variations in temperature on local through to global scales, etc., have provided impetus for recent vigorous redevelopments of extreme value analysis. Extremely large data sets and high dimensional data now becoming available in genetics, finance, physics, astronomy, and many other areas, have spurred exponential advances in statistical theory and practice with special emphasis on robustness issues,  in recent years. The need to analyse large, linked, data sets in health, crime, agriculture, surveys, and industry,  just to name a few, has revolutionised our profession. It's an exciting time to be a statistician. The aim of the Robust Statistics and Extremes (RS&E) conference is to provide an opportunity for researchers to present up-to-date accounts of the present state of the art in the topics of Robust Statistics and Extremes. A number of distinguished speakers, both international and Australian, will give keynote addresses in their areas of interest. Special provision will be made for student participation.

More Related