Statistick á analýza dat. Daniel Svozil Laboratoř informatiky a chemie, FCHT Vojtěch Spiwok, ÚBM, FPBT. Informace. přednášky – základy, opáčko, cvičení – R http ://ich.vscht.cz/~ svozil/teaching.html Další literatura D. J. Rumsey, Statistics for Dummies, 2011

Statistick á analýza dat

Valuing houses

How much money should you expect to pay for 1 300 ft2 house?

104 000 \$

Same question now with 1 800 ft2 ?

144 000 \$

Valuing houses

How much money should you expect to pay for 2 100 ft2 house?

168 000 \$

21 is just a half between 18 and 24.

What a statistician does?

• Look at data

• Program computers

• Run statistics software

• Drink beer

Linear relationship

Is there a fixed amount per square foot?

No

What if I change 1 400 to 1 300? What is the answer now?

Yes

Scatter plots (bodový graf)

• Please, take a pen and a paper and draw a scatter plot of these data.

PRICE

SIZE

Scatter plots

Is there a fixed price per square foot?

No

Scatter plots

What do you think, is the data linear?

Let’s make a scatter plot.

Surprisingly, the data is linear, even if there is no fixed price per square foot!

PRICE = AA x SIZE + BB

PRICE = 30 x SIZE + 2 000

Scatter plots

Draw scatterplot and tell me if these data are linear (i.e., do they lie in a line?).

outliers

Bar chart

Warm up. Are these data linear?

No

How much to pay for a 2 200 ft2 house? Just simply interpolate.

105 000

Do you have trust in this number?

Bar chart

• Take your data and pull them together.

Bar chart

• Much finer representation of the data

• Bar chart allows you to understand global trends

• Statistician uses cumulative tools (such as bar graph) to gain the understanding of the underlying data.

Histograms

• Special case of bar chart.

• Bar chart looks at 2D data, histogram to 1D data. That is the main difference.

Age distribution

• Draw a histogram at the paper with the bins by 10 years (i.e. 0-10, 11-20, …)

29

27

14

21

12

9

17

14

32

39

3

9

4

33

38

29

21

31

8

15

Věková pyramida

věková pyramida (strom života)

grafické znázornění věkové struktury obyvatelstva

source: http://cs.wikipedia.org/wiki/V%C4%9Bkov%C3%A1_pyramida

Histogram

• Now I will collect heights of all of you in this room.

• Use Interactive Histogram Applet: http://www.shodor.org/interactivate/activities/Histogram/

• interval, bin

Histogram – Body fat

• In Interactive Histogram Applet – choose „Body fat % in 252 men“ dataset.

• Find reasonable bin size

• Answer following questions. No matter of bin size what is always true?

• Most scores fall around 20%.

• The shape is roughly symmetrical.

• Most scores fall in the middle of distribution.

• There are more scores between 15 and 25 than between 35 and 50.

• There are more scores between 0 and 10 than between 18 and 24.

• Relatively more men have a body fat above 35% or below 5%.

Histogram – Income distribution

• United States Census Bureau – http://www.census.gov

Histogram – Income distribution

• This is an example of a (positively) skewed distribution (zprava zešikmené rozdělení).

• This distribution is not symmetrical.

• Most incomes fall to the left of the distribution.

Pie charts

• koláčový graf

• elections

• Party A – 50%

• Party B – 50%

• Party A – 724 000 votes

• Party B – 181 000 votes

A

B

C

D

E

Pie charts

• Party A – 175 000

• Party B – 50 000

• Party C – 25 000

• Party D – 50 000

• Please, draw the pie chart

A: 4/12

B: 2/12

C: 1/12

D: 2/12

Bar chart and scatter plot

• Which scatter plot corresponds to this bar chart?

Pie chart to histogram

• Which histogram looks like it cames from the same data?