1 / 33

Exploratory Data Analysis

Exploratory Data Analysis. Hal Varian 20 March 2006. What is EDA?. Goals Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis Methods of analysis Primarily graphics and tables Online reference

shaw
Download Presentation

Exploratory Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploratory Data Analysis Hal Varian 20 March 2006

  2. What is EDA? • Goals • Examine and summarize data • Look for patterns and suggest hypotheses • Provide guidance for more systematic analysis • Methods of analysis • Primarily graphics and tables • Online reference • http://www.itl.nist.gov/div898/handbook/eda/eda.htm • http://www.math.yorku.ca/SCS/Courses/eda/

  3. Tools for EDA • We will use R = open source S • Very widely used by statisticians • Libraries for all sorts of things are available • Download from • cran.stat.ucla.edu • http://www.r-project.org/ • Recommend ESS (=Emacs Speaks Statistics) for interactive use • Windows interface is not bad

  4. > library("foreign") > dat <- read.spss("GSS93 subset.sav") > attach(dat) > summary(AGE) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.0 33.0 43.0 46.4 59.0 99.0 > hist(AGE) Interactive R session

  5. Histogram of age

  6. Recode missing data • AGE[AGE>90] <- NA • plot(density(AGE,na.rm=T)) • #plot both together • hist(AGE,freq=F) • lines(density(AGE,na.rm=T))

  7. Density and density + hist

  8. Boxplot • Boxplot • Outlier • 1.5 interquartile range • 3rd quartile • Median • 1st quartile • Smallest value

  9. Boxplot enhancements • Notches: confidence interval for median • Varwidth=T: width of box is sqrt(n) • Useful for comparisons

  10. Comparing distributions • boxplot(AGE~RACE) • boxplot(AGE~RACE,notch=T,varwidth=T) Doesn’t seem to be big diff in age distn

  11. EDUC v RACE boxplot(EDUC[EDUC<90]~RACE[EDUC<90], notch=T,varwidth=T)

  12. Violin plot • Combines density plot and boxplot • Good for weird shaped distributions…

  13. Back to Back Histogram • library("Hmisc") • histbackback(EDUC[RACE=="black"],EDUC[RACE=="white"],probability=T)

  14. Two-way table • GT12 <- EDUC>12 • temp <-table(GT12,RACE) • GT12 white black other • FALSE 614 100 37 • TRUE 640 67 38 • prop.table(temp,2) • GT12 white black other • FALSE 0.4896332 0.5988024 0.4933333 • TRUE 0.5103668 0.4011976 0.5066667

  15. Comparing distributions • qqplot = quantile-quantile plot • Fraction of data less than k in x • Fraction of data less than k in y • Shapes • Straight line: same distribution • Vertical intercepts differ: different mean • Slopes differ: different variance • Reference distribution can be theoretical distn • qnorm – compare to standardized normal • Skew to right: both tails below straight line • Heavy tails: lower tail above, upper tail below line

  16. qqplot(x,y) examples Mean1=0 Mean2=2 identical Sample v N(0,1), with ref line s1=1 s2=2

  17. More qqnorm examples Skewed to right Heavy tails www.maths.murdoch.edu.au/units/statsnotes/samplestats/qqplot.html

  18. Pairs of variables • Is one variable related to another? • Scatterplot • Basic: plot(x,y) • Enhanced from library(“car”): scatterplot(x,y) • Scatterplot matrix • Basic: pairs(data.frame(x,y,z)) • Enhanced: scatterplot.matrix(data.frame(x,y,z))

  19. Basic and enhanced scatterplot

  20. Scatterplot matrix

  21. Labeling points in scatterplots • identify(x,y,labels=“foo”) • Color is also useful

  22. Cigarettes and taxes • Discussant on paper by Austan Goolsbee, “Playing with Fire” • Question: did Internet purchases of cigarettes affect state tobacco tax revenues?

  23. Cigarette Prices in 1990s

  24. Internet usage

  25. Price elasticity of use/sales • Across all states and years • Taxable sales elasticity: -0.802 • Use elasticity: -0.440 • Sales are much more responsive to price than usage suggesting that there is some cross border trade (aka “buttlegging”)

  26. Use vs Sales in 2000

  27. Reduced form • dp = log(p2001) – log(p1995) • dq = log(q2001) – log(q1995) • Regress dq/dp on internet penetration in 2000 • See next slide for result

  28. What is Internet providing? • It was always a good deal for some to buy cigarettes out-of-state (in high tax states) • Mail order has been around for a long time and is certainly cost-effective • Internet makes it easier to find merchants – just type into search engine • Internet is great at matching buyers and sellers

  29. Price of a match • Google doesn’t accept cigarette advertisements, but Overture does • Price for top listing: $1.20 per click • Avg price for click on Overture is 40 cents • Conversion rates might be 5%, so advertiser is paying $24 for introduction • But think of lifetime value…

  30. Value of a match • Google doesn’t accept cigarette advertisements, but Overture does • Price for top listing: $1.20 per click • Avg price for click on Overture is 40 cents • Conversion rates might be 5%, so advertiser is paying $24 for introduction • But think of lifetime value…

  31. Straightening out and scaling data • Find transform so that data looks linear, or normal, or fits on same scale • Log10 (easier to interpret than log) • Square root • Reciprocal • Box-Cox transform (xr – 1)/r which combines many of above; r=0 is log

  32. City sizes: regular & log10

More Related