univariate eda n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Univariate EDA PowerPoint Presentation
Download Presentation
Univariate EDA

Loading in 2 Seconds...

play fullscreen
1 / 54

Univariate EDA - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Univariate EDA. (Exploratory Data Analysis). EDA. John Tukey (1970s) data two components: smooth + rough patterned behaviour + random variation resistant measures/displays little influenced by changes in a small proportion of the total number of cases

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Univariate EDA' - zeal


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
univariate eda

Univariate EDA

(Exploratory Data Analysis)

slide4
EDA
  • John Tukey (1970s)
  • data
    • two components:
      • smooth + rough
      • patterned behaviour + random variation
  • resistant measures/displays
    • little influenced by changes in a small proportion of the total number of cases
    • resistant to the effects of outliers
    • emphasizes smooth over rough components
  • concepts apply to statistics and to graphical methods
tree ring dates ad
Tree Ring dates (AD)

1255 1239 1162 1239 1240 1243 1241 1241 1271

  • 9 dendrochronology dates
  • what do they mean????
  • usually helps to sort the data…
stem and leaf diagram

1162 1239 1239 1240 1241 1241 1243 1255 1271

Stem-and-Leaf Diagram

11|62

12|39,39,40,41,41,43,55,71

  • original values preserved
  • no rounding, no loss of information…
can simplify in various ways
can simplify in various ways…

11|6

12|44444467

  • ‘leaves’ rounded to nearest decade
  • ‘stem’ based on centuries
slide8
1162 1239 1239 1240 1241 1241 1243 1255 1271

116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1

‘stem’ based on decades…

slide9
1162 1239 1239 1240 1241 1241 1243 1255 1271

116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1

highlights existence of gaps in the distribution of dates, groups of dates…

slide10
R
  • stem()
  • vuround(runif(25, 0, 50),0); stem(vu)
  • vnround(rnorm(25, 25, 10),0); stem(vn)
  • stem(vn, scale=2)
slide11

Back-to-back stem-and-leaf plot

rimdiameterdata (cm)

percentiles
percentiles
  • useful for constructing various kinds of EDA graphics
  • don’t confuse percentile with percent or proportion

Note:

  • frequency = count
  • relative frequency = percent or proportion
percentiles1
percentiles

“the pth percentile of a distribution:

 number such that approximately p percent of the values in the distribution are equal or less than that number…”

  • can be calculated for numbers that actually exist in the distribution, and interpolated for numbers than don’t…
percentiles2
percentiles
  • sort the data so that x1 is the smallest value, and xn is the largest (where n=total number of cases)
  • xi is the pith percentile of a dataset of n members where:
slide15

p1 = 100(1 - 0.5) / 7 = 7.1

p2 = 100(2 - 0.5) / 7 = 21.4

p3 = 100(3 - 0.5) / 7 = 35.7

p4 = 100(4 - 0.5) / 7 = 50

etc…

[1]

slide16

?

?

25

85

50

50th percentile:i=(7*50)/100 + .5i=4, xi=7

25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5

slide17

?

25

25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5

if i < > integer, then…k = integer part of i; f = fractional part of ixint = interpolated value of x

xint = (1-f)xk + fxk+1xint= (1-.25)*3+.25*5

xint= 3.5

use r
use R!!
  • test<-c(1,3,5,7,9,9,14)
  • quantile(test, .25, type=5)
slide19

“boxplot”

inner fence

lower hinge

upper hinge

inner fence

interquartilerange(midspread)

(1.5 x midspread)

percentiles:

25th

50th

75th

slide20

Figure 6.25: Internal diversity of neighbourhoods used to define N-clusters, measured by the 'evenness' statistic H/Hmax on the basis of counts of various A-clusters, and broken down by N-cluster and phase. [Boxes encompass the midspread; lines inside boxes indicate the median, while whiskers show the range of cases that fall within 1.5-times the midspread, above or below the limits of the box.]

histograms
Histograms
  • divide a continuous variable into intervals called ‘bins’
  • count the number of cases within each bin
  • use bars to reflect counts
  • intervals on the horizontal axis
  • counts on the vertical axis
histogram

counts

percent

“bins”

Histogram
histograms1
Histograms
  • useful for illustrating the shape of the distribution of a batch of numbers
  • may be helpful for identifying modes and modalbehaviour
slide25

mode

mode?

mode!

  • the distribution is clearly bimodal
  • may be multimodal…
smoothing histograms
smoothing histograms
  • may want to accentuate the ‘smooth’ in a data distribution…
  • calculate “running averages” on bin counts
  • level of smoothing is arbitrary…
histogram barchart variations
histogram / barchart variations
  • 3d
  • stacked
  • dual
  • frequency polygon
  • kernel density methods
slide34

Site 1

Site 2

controlling kernel density plots
controlling kernel density plots…
  • hd <- density(XX)
  • hh <- hist(XX, plot=F)
  • maxD <- max(hd$y)
  • maxH <- max(hh$density)
  • Y <- c(0, max(c(maxD, maxH)))
  • hist(XX, freq=F, ylim=Y)
  • lines(density(XX))
slide45

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

VAR00003

VAR00003

1

2

3

4

5

6

7

8

9

10

VAR00003

Dot Histogram [R: stripchart()]

method = “stack”

slide46

line plot

cooking/service

service

ritual

slide47

cooking/service

service

ritual

slide49

20%

19%

22%

18%

21%

pie chart

slide51

100

100

90

90

80

80

70

70

60

60

cumulative percent

percent

50

50

40

40

30

30

20

20

10

10

Cumulative Percent Graph

slide52

100

90

80

70

60

cumulative percent

50

40

30

20

10

Cumulative Percent Graph

  • good for comparing data sets
  • some useful statistical measures
  • can be misleading when used with nominal data

(ordinal or ratio scale)