1 / 29

About me

About me. Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years Currently Senior Analyst at Deloitte Hobby – rock climbing , data mining competitions Why? - Early retirement Current interest – Text analytics .

topper
Download Presentation

About me

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. About me • Educational background – Applied Econometrics • 4 years statistical modelling experience • R experience – 2 years • Currently Senior Analyst at Deloitte • Hobby – rock climbing, data mining competitions • Why? - Early retirement • Current interest – Text analytics

  2. Topic: The benefits of R from a data mining competitor’s point of view and from the point of view of an employee at Deloitte • Work • Professional and pragmatic Home The playful scientist

  3. Agenda • Quick introduction to R • What I use R for • R at work • Introduction to Deloitte • Frequently used tools • Some of the work we do using R • Examples • Challenges: Data Storage • Challenges: Standardisation • How Deloitte is addressing this issue • R at home: • Some of the work I do using R, at home • Flexibility and convenience • Examples • Prototyping and experimenting • Examples • Questions • Essential R packages for everyday use

  4. Quick introduction to R • “A statistical software created by statisticians, for statisticians” • Personally, I use R for data analysis and statistical modelling • Unique features worth noting: • Open source – free, easy to find help in the active community • Understands mathematical computations and matrix operations naturally • Thousands of packages, implementations of almost any algorithm

  5. Introduction to RThousands of packages, implementations of almost any algorithm Packages ggplot2 EBImage randomForest etc N = 500+

  6. R at work

  7. Introduction to Deloitte • We help clients capture, manage and analyse data to help solve important business problemsto make informed decisions • A holistic process of data mining

  8. Introduction to Deloitte: Typical activity involved in a project at Deloitte But not everything is R Data preparation Level of Activity Modeling Planning processes Data loading Closing processes 20% - 40% time spent on modelling Initiating processes Time line

  9. Frequently used tools • Geospatial analytics - Tactician • Segmentation - Self Organising maps • SQL server • Modelling • Visualisation

  10. Some of the work we do using R • In Deloitte • Statistical Analysis and Predictive modelling • Time series analysis • Social Network Analysis • Data visualisation • Text analytics (NEW!)

  11. Examples: Time Series Actual --- Estimate y – retail activity? Fitted Time (days) R package: forecast

  12. Challenges: Data Storage • We have a dedicated tool to store and clean data – SQL • R cannot handle large data sets Error: cannot allocate vector of size 2097151 Kb

  13. Challenges: Standardisation • ‘You’re not the only one using it” One of the reason’s why other commercial tools are preferred over R • Transferable skills across the team • Reliability of packages • Standardised functions and procedures

  14. How Deloitte is addressing this issue • Creating standardised process: R package: RODBC

  15. How Deloitte is addressing this issue • Creating standardised functions: • # Density Plot for subject variable • DensityPlot <- function(dataset, col) { • ds <- data.frame(dataset);ds$c<- ds[,c(col)];a <- ggplot(data=ds, aes(x=c) ) • a <- a + geom_density(kernel="biweight");a • } • DensityPlot (dataset, column number) • Retrieving data from the database (RODBC): • conn <- odbcDriverConnect("driver=SQL Server; database=DataBaseName; server=servername;") • query <- “Select * from TableName” • df<- sqlQuery(conn,query) R package: RODBC

  16. R at home

  17. Some of the work I do using R, at home At home (data mining competitions) • Statistical analysis and Predictive modelling • Time series analysis • Social Network Analysis • Data visualisation • Text analytics • Image analysis • (I mainly use R) • In Deloitte • Statistical Analysis and Predictive modelling • Time series analysis • Social Network Analysis • Data visualisation • Text analytics (NEW!) • (we don’t just use R)

  18. Flexibility and convenience • Is one of the easier programming languages to pick up • Dive into the analysis quickly

  19. Examples • Image analysis R package: EBImage

  20. Examples • Image Analysis R package: EBImage

  21. Prototyping and experimenting • Access to the latest most innovative techniques • Great for prototyping new algorithms

  22. Examples:Text analytics R package: twitteR +

  23. Examples: Word cloud of twitter feeds R package: wordcloud

  24. Examples:Text analytics What are the common themes that are being tweeted by Time magazine? + = ?

  25. A Top words associated to the classification Tweet B C D A B C D R package: ggplot2

  26. Classification results

  27. Questions?

  28. Essential R packages for everyday use • Essential • ggplot2 • reshape • RODBC • randomForest • rpart • Nice to have • caret • forecast • tm

More Related