1 / 65

Chapter 2

Chapter 2. Descriptive statistics for quantitative data. 定量资料的描述性统计分析. review. Types of data. Numerical data: --- continuous --- discrete Categorical data: --- nominal --- ordinal. review. Statistics :

radwan
Download Presentation

Chapter 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

  2. review Types of data • Numerical data: • --- continuous • --- discrete • Categorical data: • --- nominal • --- ordinal

  3. review Statistics : It is a branch of applied mathematics that refers to the collection and interpretation of data, and evaluation of the reliability of the conclusions based on the data.

  4. Types of statistical analysis • Descriptive analysis : • ---Data collection • ---Data interpretation • Inferential analysis : • ---Evaluate the reliability of the conclusions

  5. Contents • Frequency distribution ★ • Central tendency★ • Dispersion (measures of variability) ★ • Tables and graphs

  6. New words • Frequency 频数 • Proportion 比例 • Percentage 百分数 • Histogram 直方图 • Polygon 折线图 • Distribution 分布 • Frequency distribution 频数分布

  7. Cumulative frequency 累积频数 • Cumulative proportion 累积比例 • Central tendency 集中趋势 • Dispersion 离散程度 • Mean 均数 • Arithmetic mean 算术均数 • Geometric mean 几何均数

  8. Median 中位数 • Mode 众数 • Skewness 偏度 • Kurtosis 峰度 • Descriptive analysis 描述分析 • Inferential analysis 推断分析

  9. 1. Frequency distribution Frequency (频数): For a given variable, the number of times a value occurs is called its frequency. • Id sex age • m 6 • m 8 • f 13 • m 16 • f 16 • f 15 • f 23 • m 19 • f 25 • f 21 • m 13 • f 19 • f 9 • f 10 • f 14 • Frequency table of sex • Sex Label Frequency • m Male 5 • f Female 10

  10. Proportion or percent (比例或百分数): The ratio of a frequency to total frequency Frequency table of sex Sex Label Frequency proportion -------------------------------------------------- m Male 5 33.33 f Female 10 66.67 -------------------------------------------------- Total m+f 15 100.00

  11. Freq distribution of sex Sex Frequency Percentage m 5 33.33 f 10 66.67 Frequency distribution: A table or a graph that list all the distinct values in a variable together with the freq and proportion of these values occurs

  12. Method of displaying frequency distribution of categorical data • Nominal data • Ordinal data

  13. Freq distribution of nominal data Freq distribution of sex Sex Frequency Percentage m 5 33.33 f 10 66.67 • Id sex eyesight age • m 1 6 • m 2 8 • f 3 13 • m 3 16 • f 4 16 • f 4 15 • f 5 23 • m 6 19 • f 6 25 • f 6 21 • m 7 13 • f 7 19 • f 8 9 • f 9 10 • f 9 14

  14. Freq distribution of ordinal data Freq distribution of eyesight Eyesight Frequency Percentage 1-3 4 26.67 4-6 6 40.00 • Id sex eyesight age • m 1 6 • m 2 8 • f 3 13 • m 3 16 • f 4 16 • f 4 15 • f 5 23 • m 6 19 • f 6 25 • f 6 21 • m 7 13 • f 7 19 • f 8 9 • f 9 10 • f 9 14

  15. Method of displaying frequency distribution of numerical data • first dividing the whole interval into several un-overlapped subintervals, • count how many observations lies in each subinterval to make a frequency table, • take the midpoint of each subinterval as x-axis label, draw a histogram(直方图) or a polygon (折线图).

  16. [0-10) [10-20) [20-30] Freq distribution of numerical data Freq distribution of age Age midpoint Frequency 0~ 5 3 10~ 15 9 20~30 25 3 • Id sex eyesight age • m 1 6 • m 2 8 • f 3 13 • m 3 16 • f 4 16 • f 4 15 • f 5 23 • m 6 19 • f 6 25 • f 6 21 • m 7 13 • f 7 19 • f 8 9 • f 9 10 • f 9 14

  17. Histogram and polygon Histogram polygon

  18. Nominal data Ordinal data Numerical data

  19. Cumulative frequency and cumulative proportion Cumulative frequency (累计频数): sum of total frequency from low to a certain category Cumulative proportion (累计比例): sum of total proportion from low to a certain category Frequency table of age Cumulative Cumulative Age midpoint Frequency Proportion frequency proportion 0-10 5 3 20.0 3 20.0 10-20 15 9 60.0 12 80.0 20-30 25 3 20.0 15 100.0

  20. The plot of cumulative frequency and cumulative proportion

  21. The major measures of the characteristics of observations for a numerical variable • Central tendency (集中趋势) • Dispersion (离散程度)

  22. 2. Central tendency Central tendency(集中趋势): The description of the concentration near the middle of the range of all values in a variable. The major measures of central tendency are: mean, median, mode.

  23. The mean The mean (均数) : It is a measure of the average level of all observations in a variable, it is defined as follow: sample mean population mean ---------Arithmetic mean (算术均数)

  24. Solution: Data: id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132 n=12 = (121+118+…+125+132)/12 = 123.5 So, the estimated mean of the Haemoglobin is 123.5 g/L. Eg1a: Estimate the mean The data listed below is the content of haemoglobin (g/L) (血色素), estimate the mean.

  25. Data: Formula: x freq x1 f1 x2 f2 …… …… xk f k n Another formula for mean If x has k different values, and fi is the frequency of i-th value xi occurring in the sample, then the sample mean can be estimated as follow:

  26. Solution: n=101, =(3×9+4×32+5×42+ 6×15+7×3) / 101 = 4.71 (mmol/L) Eg1b: Estimate the mean The following data are measured serum cholesterol (血清胆固醇) from 101 aged 30-49 men. Estimate the mean. data: Serum Mid- Cholest. point Freq. 2.5 ~ 3.0 9 3.5 ~ 4.0 32 4.5 ~ 5.0 42 5.5 ~ 6.0 15 6.5 ~ 7.0 3 101

  27. In which, are ordered values in pop, the are ordered values in sample. the The median The median (中位数): It is a middle measure in an ordered values of all observations in a variable. It is defined as below: population median sample median

  28. The method of estimating the median: • Order all values of observations in a variable from smaller to larger; • If n is odd, find out middle one observation, this value is the required median; • If n is even, find out middle two observations, the average of this two values is the required median. eg, if n=9, then m=x((9+1)/2)=x(5)=x5 if n=10, then m=x((10+1)/2)=x(5.5)=(x5+x6)/2

  29. Data: id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132 Eg2a: Estimate the median The data listed below is the content of haemoglobin (g/L), estimate the median. Solution: The ordering values are: 116,118,118,120,121,122, 124,125,127,129,130,132. n=12, is even, therefore, med= (122+124)/2=123 So, the median of the Haemoglobin is 123 g/L.

  30. Solution: Data: Since n=101 is odd number, so the median is middle one value, that is, the ordering number is 51, from the data, the 51th value is 5.0, ie, the median M=5.0. More accurate value of M is 4.5+(5.5-4.5) / 42×10=4.74 Serum Mid- Cholest. point Freq. 2.5 ~ 3.0 9 3.5 ~ 4.0 32 4.5 ~ 5.0 42 5.5 ~ 6.0 15 6.5 ~ 7.0 3 Eg2b: Estimate the median The following data are measured serum cholesterol (mmol/L) from 101 aged 30-49 men. estimate the median.

  31. Mean=4.71 Median=5.0 Frequency distribution about mean and median

  32. median mean median mean Skewed distribution positive or right skewed negative or left skewed

  33. mean median less (ranks) more (actual values) information not available for ordinal data available for any data data available Mean=median symmetric + skewed Mean>median size in magnitude - skewed Mean<median Comparing mean and median

  34. The definition of median • The median is a value for which no more than half the data are smaller than it and no more than half the data are larger than it. • eg, 12, 14, 14, 15, 16, 16, 16, 17, 18. M=16, for which, four < M and two>M.

  35. The Geometric mean When distribution of a variable is not symmetry, or the data has no up or low bound, then the geometric mean is a best measure for the central tendency.

  36. Eg3. The following data are 10 patients’ white blood cell counts(×1000): 11, 9, 35, 5, 9, 8, 3, 10, 12, 8. Estimate the arithmetic mean and geometric mean.

  37. The mode The mode (众数): It is defined as the most frequently occurring values in a set of data. • It is a relatively great concentration. • If a data consists of the values: 6,7,7,8,8,8,8,9,10,11,11,12,12,12,12,13 • then the mode is 8 and 12.

  38. Summary • Frequency distribution • Histogram & polygon • Measures of central tendency • Measures of dispersion

  39. Note: When the width of subinterval are not equal, or the data no up or low bound, then polygon is more available than histogram. Frequency distribution of birthweight

  40. New words • Dispersion 离散程度 • Range 全距 • Deviation 离均差 • Variance 方差 • Standard deviation 标准差 • Coefficient of variation 变异系数

  41. New words • Quartile 四分位数 • Percentile 百分位数 • Inter-quartile interval 四分位间距

  42. §3. Dispersion Dispersion (离散程度): The indication of a spread of measurements around the center of a variable distribution The major measures of dispersion are: range, variance, standard deviation, inter-quartile range, coefficient of variation, etc.

  43. The range The range (全距): It measures the distributed length of data. Population range Sample range Range = max - min Range = max - min * It is a simple measure, it has the same unit as the original data. # It use less information (only max & min); # Sample range underestimates the pop range—biased, inefficient # It convey no information about the middle of the distribution.

  44. The quartiles The first-quartile (第一四分位数) Q1: It is a value, for which no more than 25% of observed values are less than it, and no more than 75% of observed values are greater than it. M X1 Xn ≤25% ≤ 75%

  45. The second-quartile (第二四分位数) Q2=M: It is a value, for which no more than 50% of observed values are less than it, and no more than 50% of observed values are greater than it. M X1 Xn ≤50% ≤ 50%

  46. The third-quartile (第三四分位数) Q3: It is a value, for which no more than 75% of observed values are less than it, and no more than 25% of observed values are greater than it. M X1 Xn ≤75% ≤ 25%

  47. ≤ 25% ≤ 25% ≤ 25% ≤ 25% Q3 Q2 Q1 Location of quartiles M X1 Xn ≤ 50% ≤50%

  48. The method of estimate the quartiles If the subscript is not an integer or half-integer,then it is rounded up to a nearest integer or half-integer.

  49. Eg1: Estimate the quartiles • A B • 34 34 • 36 36 • 37 37 • 39 39 • 40 40 • 41 41 • 42 42 • 43 43 • 44 • 45 • -------------- • n=9 n=10

  50. The inter-quartile range (四分位数间距) : It is a the difference between Q1 and Q3: Q3-Q1. Q1 Q3 M X1 Xn Middle 50%

More Related