1 / 54

Univariate EDA

Univariate EDA. (Exploratory Data Analysis). EDA. John Tukey (1970s) data two components: smooth + rough patterned behaviour + random variation resistant measures/displays little influenced by changes in a small proportion of the total number of cases

zeal
Download Presentation

Univariate EDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Univariate EDA (Exploratory Data Analysis)

  2. EDA • John Tukey (1970s) • data • two components: • smooth + rough • patterned behaviour + random variation • resistant measures/displays • little influenced by changes in a small proportion of the total number of cases • resistant to the effects of outliers • emphasizes smooth over rough components • concepts apply to statistics and to graphical methods

  3. Tree Ring dates (AD) 1255 1239 1162 1239 1240 1243 1241 1241 1271 • 9 dendrochronology dates • what do they mean???? • usually helps to sort the data…

  4. 1162 1239 1239 1240 1241 1241 1243 1255 1271 Stem-and-Leaf Diagram 11|62 12|39,39,40,41,41,43,55,71 • original values preserved • no rounding, no loss of information…

  5. can simplify in various ways… 11|6 12|44444467 • ‘leaves’ rounded to nearest decade • ‘stem’ based on centuries

  6. 1162 1239 1239 1240 1241 1241 1243 1255 1271 116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1 ‘stem’ based on decades…

  7. 1162 1239 1239 1240 1241 1241 1243 1255 1271 116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1 highlights existence of gaps in the distribution of dates, groups of dates…

  8. R • stem() • vuround(runif(25, 0, 50),0); stem(vu) • vnround(rnorm(25, 25, 10),0); stem(vn) • stem(vn, scale=2)

  9. Back-to-back stem-and-leaf plot rimdiameterdata (cm) 

  10. percentiles • useful for constructing various kinds of EDA graphics • don’t confuse percentile with percent or proportion Note: • frequency = count • relative frequency = percent or proportion

  11. percentiles “the pth percentile of a distribution:  number such that approximately p percent of the values in the distribution are equal or less than that number…” • can be calculated for numbers that actually exist in the distribution, and interpolated for numbers than don’t…

  12. percentiles • sort the data so that x1 is the smallest value, and xn is the largest (where n=total number of cases) • xi is the pith percentile of a dataset of n members where:

  13. p1 = 100(1 - 0.5) / 7 = 7.1 p2 = 100(2 - 0.5) / 7 = 21.4 p3 = 100(3 - 0.5) / 7 = 35.7 p4 = 100(4 - 0.5) / 7 = 50 etc… [1]

  14. ? ? 25 85 50  50th percentile:i=(7*50)/100 + .5i=4, xi=7 25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5

  15. ? 25 25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5 if i < > integer, then…k = integer part of i; f = fractional part of ixint = interpolated value of x xint = (1-f)xk + fxk+1xint= (1-.25)*3+.25*5 xint= 3.5

  16. use R!! • test<-c(1,3,5,7,9,9,14) • quantile(test, .25, type=5)

  17. “boxplot” inner fence lower hinge upper hinge inner fence interquartilerange(midspread) (1.5 x midspread) percentiles: 25th 50th 75th

  18. Figure 6.25: Internal diversity of neighbourhoods used to define N-clusters, measured by the 'evenness' statistic H/Hmax on the basis of counts of various A-clusters, and broken down by N-cluster and phase. [Boxes encompass the midspread; lines inside boxes indicate the median, while whiskers show the range of cases that fall within 1.5-times the midspread, above or below the limits of the box.]

  19. Cleveland, W. S. (1985) The Elements of Graphing Data.

  20. Histograms • divide a continuous variable into intervals called ‘bins’ • count the number of cases within each bin • use bars to reflect counts • intervals on the horizontal axis • counts on the vertical axis

  21. counts percent “bins” Histogram

  22. Histograms • useful for illustrating the shape of the distribution of a batch of numbers • may be helpful for identifying modes and modalbehaviour

  23. mode mode? mode! • the distribution is clearly bimodal • may be multimodal…

  24. important variables in histogram constuction: • bin width • bin starting point

  25. smoothing histograms • may want to accentuate the ‘smooth’ in a data distribution… • calculate “running averages” on bin counts • level of smoothing is arbitrary…

  26. histogram / barchart variations • 3d • stacked • dual • frequency polygon • kernel density methods

  27. dual barchart

  28. Site 1 Site 2

  29. ‘mirror’ barchart

  30. stacked barchart

  31. 3d barchart

  32. frequency polygon

  33. kernel density model

  34. controlling kernel density plots… • hd <- density(XX) • hh <- hist(XX, plot=F) • maxD <- max(hd$y) • maxH <- max(hh$density) • Y <- c(0, max(c(maxD, maxH))) • hist(XX, freq=F, ylim=Y) • lines(density(XX))

  35. Dot Plot [R: dotchart()]

  36. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 VAR00003 VAR00003 1 2 3 4 5 6 7 8 9 10 VAR00003 Dot Histogram [R: stripchart()] method = “stack”

  37. line plot cooking/service service ritual

  38. cooking/service service ritual

  39. 20% 19% 22% 18% 21% pie chart

More Related