1 / 66

CTRC Core Curriculum Seminar Series

CTRC Core Curriculum Seminar Series. Descriptive Statistics: Data Types and Measures, Central Tendency, Variability Chang-Xing Ma, PhD Associate Professor Department of Biostatistics, UB January 4, 2012. Disclosure Statement. Chang-Xing Ma, PhD Nothing to disclose. Goals and Objectives.

moshe
Download Presentation

CTRC Core Curriculum Seminar Series

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CTRC Core Curriculum Seminar Series Descriptive Statistics: Data Types and Measures, Central Tendency, Variability Chang-Xing Ma, PhD Associate Professor Department of Biostatistics, UB January 4, 2012

  2. Disclosure Statement • Chang-Xing Ma, PhD • Nothing to disclose

  3. Goals and Objectives • Goals: Gain the knowledge of basic statistics and how to describe the data • Objectives: • Describe the data type • Summarize data • Understand Measure of Central Tendency • Understand Measure of Dispersion

  4. Outline • Basic concepts of biostatistics • Data type • Summarize data • Measure of Central Tendency • Measure of Dispersion

  5. Some terminology • Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data • Biostatistics—the theory and techniques for collecting, describing, analyzing, and interpreting health data.

  6. Some terminology • Population refer to all measurements or observations of interest • Sample is simply a part of the population. But the sample MUST represent the population. • A random sample is such a representative sample • The sample must be large enough • The sample should be selected randomly

  7. Some terminology • Parameter is some numerical or nominal characteristic of a population • A parameter is constant, e.g. mean of a population • Usually unknown • Statisticis some numerical or nominal characteristic of a sample. • We use statistic as an estimate of a parameter of the population • It tends to differ from one sample to another • We also use statistic to test hypothesis

  8. Parameters (µw,σw2), Population: all U.S. persons ~ Normal (µh,σh2), A random sample: sample size = Gender Height Weight statistics A sample mean height: std height: mean weight std weight % of male (=1)

  9. Sources of data Records Surveys Experiments Comprehensive Sample

  10. Types of variables Quantitative variables Qualitative variables Quantitative continuous Qualitative nominal Quantitative discrete Qualitative ordinal

  11. Data Types • Numerical (Quantitative) • numerical measurement • Height • Weight • Categorical (Qualitative) • with no natural sense of ordering • Gender • Hair color • Blood type

  12. Numerical Variable • Continuous • Range of values • Height in inch • Discrete • Limited possible values • # of smoking per day • # of children in a family • Age -

  13. Determining Data Types • • Ordinal (Categorical) vs. Discrete (Numerical) • • Ordinal • – Cancer Stage I, II, III, IV • – Stage II ≠ 2 times Stage I • – Categories could also be A, B, C, D • • Discrete • – # of children: 0, 1, 2, … • – 4 children = 2 times 2 children

  14. Descriptive Statistics – reducing a complex mass of data to a manageable set of information • Descriptive Statistics: the summary and presentation of data to: • simplify the data • enable meaning full interpretation • support decision making • Numerical descriptive measures (few numbers) • Graphical presentations

  15. Inferential statistics From a sample • to estimate population parameters • to test hypothesis • to build the model to reflect the population • …

  16. The student test score (FCAT) • Problem 1 • Among the 6 variables, which ones are qualitative and which ones are quantitative? • Is Race nominal or ordinal? Code: Race: W – White B – Black H – Hispanic A – Asian Sex: F – Female M – Male Poverty: 0 – not poor 1 – poor Student ID Race Sex Reading Math Poverty

  17. Descriptive Statistics • Categorical variables: • Frequency distribution • Bar chart, pie chart • Contingency tables • Continuous variables: • Grouped frequency table • Central Tendency • Variability

  18. An ordered arrangement that shows the frequency of each level of a variable. Simple Frequency Distribution race Frequency Percent ----------------------------- A 7 4.07 B 42 24.42 H 8 4.65 W 115 66.86 sex Frequency Percent ---------------------------- F 86 50.00 M 86 50.00

  19. It is useful for categorical variable For continuous variable, it allows you to pick up at a glance some valuable information, such as highest, lowest value. ascertain the general shape or form of the distribution make an informed guess about central tendency values Simple Frequency Distribution

  20. Bar Chart BY • summarizing a set of categorical data - nominal or ordinal data • It displays the data using a number of rectangles, each of which represents a particular category. The length of each rectangle is proportional to the number of cases in the category it represents • can be displayed horizontally or vertically • they are usually drawn with a gap between the bars • Bars for multiple (usually two) variables can be drawn together to see the relationship

  21. Pie Chart • summarizing a set of categorical data - nominal or ordinal data • It is a circle which is divided into segments. • Each segment represents a particular category. • The area of each segment is proportional to the number of cases in that category.

  22. Complex frequency distribution Table Distribution of 20 lung cancer patients at the chest department of Alexandria hospital and 40 controls in May 2008 according to smoking

  23. How about continuous variables? • How data is distributed? • Measure of Central Tendency • Measure of Variability

  24. Grouped Frequency Distribution – for continuous variable Frequency Table DATA: Interval Size: N: µ: σ:

  25. BUT the problem is that so much information is presented that it is difficult to discern what the data is really like, or to "cognitively digest" the data. the simple frequency distribution usually need to condense even more. It is possible to lose information (precision) about the data to gain understanding about distributions. This is the function of grouping data into equal-sized intervals called class intervals. The grouped frequency distribution is further presented as Frequency Polygons, Histograms, Bar Charts, Pie Charts. Grouped Frequency Distribution

  26. Describing Distributions • Bell-Shaped Distribution • Normal distribution N (µ=0, σ2 =1) • t-distribution µ, σ2

  27. Describing Distributions • Skewed Distribution – positively skewed distribution µ, σ2

  28. Describing Distributions • Skewed Distribution – negatively skewed distribution µ, σ2

  29. Describing Distributions • Other Shapes Rectangular Bimodal µ, σ2

  30. Describing Distributions • Other Shapes J-curve µ, σ2

  31. Probability density function - Normal z-transform green curve is standard normal distribution

  32. The Mean average value not robust to outlying value Length of hospital stays:6, 4, 5, 9, 10, 7, 1, 4, 3, 4 Mean=(6+4+5+9+10+7+1+4+3+4)/10=5.3 Measure of Central TendencyMean, Median, Mode

  33. The Median is the point that divides a distribution of data into two equal parts robust to outlying value Length of hospital stays: sort data1 3 4 4 4 5 6 7 9 10 median=4.5 Measure of Central TendencyMean, Median, Mode Split Data

  34. The Mode is the midpoint of the interval that has highest frequency robust to outlying value, but sometimes misleading Length of hospital stays: sort data1 3 4 4 4 5 6 7 9 10 Mode=4, which occurred 3 times. Measure of Central TendencyMean, Median, Mode Most frequently

  35. Comparison between mean and median Mean Median

  36. Comparison between mean and median Median Mean

  37. Comparison between mean and median Mean Median

  38. Frequency distribution Histogram, Polygon graph Bar Chart, Pie Chart Describing Distributions Mean, Median, Mode Summary DATASET: http://128.205.94.145/STA2008/FL_School0022.xls

  39. Problem 2 • In a study, we collected a medical measurements X for 4 patients • Data of X: 2, 3, 5, 6 • Mean of X? • Median of X? • Mode of ?

  40. The sample range Interquartile range The sample standard deviation (SD), variance Standard error of mean (SEM) Descriptive StatisticsVariability

  41. Range – the difference between the lowest and highestFor example, Age of Patients (years): 6 13 7 14 10 14 15 9 7 2 7 13 16 9 8 3 3 17 8 5 4 9 9 6lowest 2, highest 17Range=2 -17 years When sample size increases, the range tends to increase as well. (not robust) Measures of Dispersion - Range

  42. Measures of Dispersion - Range • All of curves have the same range • Mean? • Median?

  43. Percentiles: based on dividing a sample or population into 100 equal parts. Deciles divide the distribution into 10 parts Quartiles divide the distribution into 4 equal parts. 1st quartile includes the lowest 25% of the values (Q1) 2st quartile includes the values from 26 percentile through 50 percentile (Q2) - median 3st quartile includes the values from 51 percentile through 75 percentile (Q3) Measures of DispersionPercentiles, Deciles, Quartiles

  44. Interquarile Range – the 25 percentile (1st quartile) to 75 percentile (3rd quartile) Age of Patients (years):2 3 3 4 5 66 7 7 7 8 8 9 9 9 9 10 13 13 14 14 15 16 17 1st quartile 6, 2nd quartile 8.5, 3rd 13 Interquarile Range = 6 -13 years Interquarile Range is a robust estimate of data variability Measures of DispersionInterquarile Range

  45. Measures of DispersionInterquarile Range Robust estimate, less efficient

  46. Deviations from the meanVariance and Standard Deviation • deviation: observation - mean • “sum” of deviation BUT

  47. Deviations from the meanVariance and Standard Deviation • Measure of how different the values in a set of numbers are from each other • Variance: • Standard Deviation:

  48. Deviations from the meanVariance and Standard Deviation • Data set: 2,3,5,6 Calculation: Value of X (X- ) (X- )2 2 -2 4 3 -1 1 5 1 1 6 2 4 ∑=0 ∑=10 Variance Standard Deviation

  49. Three normal distributions: mean=0 s2=1 s2=2 s2=0.5 Leptokurtic Homogenous Narrow scatter Mesokurtic Platykurtic Heterogeneous wide scatter Central Tendency mean=0

  50. Example 2: FEV1 (litres) of 57 male medical students Table: FEV1 (litres) of 57 male medical students 2.85 3.19 3.50 3.69 3.90 4.14 4.32 4.50 4.80 5.20 2.85 3.20 3.54 3.70 3.96 4.16 4.44 4.56 4.80 5.30 2.98 3.30 3.54 3.70 4.05 4.20 4.47 4.68 4.90 5.43 3.04 3.39 3.57 3.75 4.08 4.20 4.47 4.70 5.00 3.10 3.42 3.60 3.78 4.10 4.30 4.47 4.71 5.10 3.10 3.48 3.60 3.83 4.14 4.30 4.50 4.78 5.10

More Related