those who don t know statistics are condemned to reinvent it david freedman
Download
Skip this Video
Download Presentation
Those who don’t know statistics are condemned to reinvent it… David Freedman

Loading in 2 Seconds...

play fullscreen
1 / 39

Those who don’t know statistics are condemned to reinvent it… David Freedman - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Those who don’t know statistics are condemned to reinvent it… David Freedman. All you ever wanted to know about the histogram and more . 1. 400. 300. 200. 100. 0. 0.0. 10.0. 20.0. 30.0. 40.0. 50.0. 60.0. 70.0. 80.0. 90.0. 5.0. 15.0. 25.0. 35.0. 45.0. 55.0. 65.0. 75.0.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Those who don’t know statistics are condemned to reinvent it… David Freedman' - lazzaro


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
distribution of no of graphics on web pages n 1873
1

400

300

200

100

0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

5.0

15.0

25.0

35.0

45.0

55.0

65.0

75.0

85.0

95.0

Distribution of No of Graphics on web pages (N=1873)

Mean = 17.93

Median = 16.00

Std. Dev = 17.92

N = 1873

Graphic Count

distribution of redundant link on web pages n 1861
3

1000

800

600

400

200

0

0.0

40.0

80.0

160.0

400.0

200.0

240.0

120.0

280.0

360.0

440.0

320.0

480.0

Distribution of Redundant Link % on web pages (N =1861)

Mean = 22.1

Median = 14

Std. Dev = 37.33

N = 1861.00

frequency table
4Frequency Table

convention: include the left endpoint in the class interval

no of fonts used on a web page
5

1000/ .5

800/ .4

600/ .3

400/ .2

200/ .1

0/ 0

1

3

5

7

9

11

13

15

Frequency

110

430

860

280

180

40

20

10

Probability

.06

.22

.45

.15

.09

.02

.01

.01

No of fonts used on a web-page

Frequency

/probability

distribution of word count n 1903
1600

1400

1200

1000

800

600

400

200

0

0.0

4000.0

2000.0

6000.0

8000.0

18000.0

12000.0

16000.0

10000.0

14000.0

20000.0

Distribution of word count (N=1903)

Mean = 393.2

Median = 223

Std. Dev = 725.24

Minimum = 0

Maximum = 20,357

distribution of word count n 1897 top six removed
7

800

600

400

200

0

0.0

800.0

400.0

1200.0

2000.0

3200.0

1600.0

2400.0

2800.0

3600.0

4000.0

Distribution of word count (N=1897) top six removed

Mean = 368.0

Median = 223

Std. Dev = 474.04

Minimum = 0

Maximum = 4132

distribution of word count n 1873
500

400

300

200

100

0

0.0

800.0

400.0

600.0

200.0

1600.0

1000.0

2200.0

2400.0

1400.0

1800.0

1200.0

2000.0

Distribution of word count (N=1873)

Mean = 333.4

Median = 220

Std. Dev = 360.30

Minimum = 0

Maximum = 4132

WORDCNT2

distribution of link count on good bad web pages
8

3

0

0

2

0

0

1

0

0

0

0

.

0

4

0

.

0

8

0

.

0

1

2

0

.

0

1

6

0

.

0

2

0

0

.

0

2

4

0

.

0

2

8

0

.

0

Distribution of link count on good & bad web-pages

Good Sites

Bad Sites

making inferences from histograms incidence of riots and temperature
9Making inferences from histograms: Incidence of riots and temperature

3

0

4

0

9

0

1

0

0

1

1

0

5

0

6

0

7

0

8

0

temperature

mean and median
Mean and Median

Mean is arithmetic average, median is 50% point

Mean is point where graph balances

  • Mean shifts around,
  • Median does not shift much, is more stable
  • Computing Median:
  • for odd numbered N
    • find middle number
  • For even numbered N
    • interpolate between middle 2,
    • e.g. if it is 7 and 9, then 8 is the median
slide24
The SD says how far away numbers

on a list are from their average.

Most entries on the list will be

somewhere around one SD away

from the average. Very few will be

more than two or three SD’s away.

understanding the standard deviation
Understanding the standard deviation

Lets start with a list: 1, 2, 2, 3

50%

25%

0%

Histogram is symmetric about 2,

2 is mean,

and 50% to left of 2, 50% to right

slide26
50%

25%

0%

List: 1, 2, 2, 3

Average = 2

SD = .8

50%

List: 1, 2, 2, 5

Average =2.5

SD = 1.73

25%

0%

50%

List: 1, 2, 2, 7

Average =3

SD = 2.71

25%

0%

slide27
Computing the standard deviation

List: 20, 10, 15, 15

Average = 15

Find deviations from average=

5, -5, 0, 0

Square the deviations:

(5)2 (-5)2 (0)2 (0)2 = 50

divide it by N-1 = 50/3 = 16.67

Square root it= 16.67 = 4.08

properties of the standard deviation
Properties of the standard deviation
  • The standard deviation is in the same units as the mean
  • The standard deviation is inversely related to sample size (therefore as a measure of spread it is biased)
  • In normally distributed data 68% of the sample lies within 1 SD
properties of the normal probability curve
Properties of the Normal Probability Curve
  • The graph is symmetric about the mean (the part to the right is a mirror image of the part to the left)
  • The total area under the curve equals 100%
  • Curve is always above horizontal axis
  • Appears to stop after a certain point (the curve gets really low)
slide30
11

1 SD= 68%

2 SD = 95%

3 SD= 99.7%

  • The graph is symmetric about the mean =
  • The total area under the curve equals 100%
  • Mean to 1 SD = +- 68%
  • Mean to 2 SD = +- 95%
  • Mean to 3 SD = +- 99.7%
  • You can disregard rest of curve
distribution of judges ratings for the webby awards
12

500

400

300

200

100

0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

Distribution of judges ratings for the Webby Awards

Mean = 6.3

Median = 6.3

Std. Dev = 1.98

N = 1867.00

Skewness = -.43

Kurtosis = -.201

slide32
It is a remarkable fact that many histograms in real life tend to follow the Normal Curve.

For such histograms, the mean and SD are good summary statistics.

The average pins down the center, while the SD gives the spread.

For histogram which do not follow the normal Curve, the mean and SD are not good summary statistics.

What when the histogram is not normal ...

slide33
13

500

400

300

200

100

0

0.0

200.0

800.0

400.0

600.0

2800.0

1000.0

1200.0

1600.0

1800.0

2600.0

1400.0

2000.0

2200.0

2400.0

Distribution of word count on web pages

Std. Dev = 384.83

Mean = 348.3

+- 3 SD = (384 * 3) = 1152

Mean - 1152 = about 30% sample had negative number of links

slide34
When SD is influenced by outliers

Use inter quartile range

75th percentile - 25th percentile

Note.

A percentile is a score below which a certain % of sample is

measures of normality
14Measures of Normality
  • Visual examination
  • Skewness: measure of symmetry

Symmetric

Positively Skewed

Negatively Skewed

kurtosis does it cluster in the middle
15Kurtosis: Does it cluster in the middle?
  • Kurtosis is based on a distributions tail.
    • Distributions with a large tail: leptokurtic
    • Distributions with a small tail: platykurtic
    • Distributions with a normal tail: mesokurtic

Large tail

Small tail

Normal Tail

positively skewed and leptokurtic word count
1600

1400

1200

1000

800

600

400

200

0

0.0

2000.0

4000.0

6000.0

8000.0

14000.0

16000.0

20000.0

10000.0

12000.0

18000.0

Positively Skewed and Leptokurtic: Word Count

Mean = 393.2

Median = 223

Std. Dev = 725.24

Skewness = 13.62

Kurtosis = 321.84

N = 1903.00

distribution of word count n 1897 top six removed38
800

600

400

200

0

0.0

800.0

400.0

1200.0

2000.0

3200.0

1600.0

2400.0

2800.0

3600.0

4000.0

Distribution of word count (N=1897) top six removed

Kurtosis = 16.40

Skewness = 3.49

Mean = 368.0

Median = 223

Std. Dev = 474.04

N = 1897.00

degree of freedom
Degree of Freedom
  • The number of independent pieces of information remaining after estimating one or more parameters
  • Example: List= 1, 2, 3, 4 Average= 2.5
  • For average to remain the same three of the numbers can be anything you want, fourth is fixed
  • New List = 1, 5, 2.5, __ Average = 2.5
ad