1 / 56

Statistics Micro Mini Statistics Review

Statistics Micro Mini Statistics Review. January 5-9, 2009 Beth Ayers. About Beth. Ph.D. student in statistics Member of the Program for Interdisciplinary Educational Research (PIER; www.cmu.edu/pier ) I do all educational research and my research focuses on three main areas

emanuel
Download Presentation

Statistics Micro Mini Statistics Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics Micro MiniStatistics Review January 5-9, 2009 Beth Ayers

  2. About Beth • Ph.D. student in statistics • Member of the Program for Interdisciplinary Educational Research (PIER; www.cmu.edu/pier) • I do all educational research and my research focuses on three main areas • Predicting student performance on end-of-year exams based on tutor performance throughout the year • Estimating student skill knowledge • Refining skill codings

  3. Goals of the Course • Review introductory statistics • Quick overview of multiple regression and multi-factor ANOVA • Discuss validity • Discuss power calculations • Learn how to read and critique papers

  4. Assessment • Complete two paper critiques • You will not have to run or interpret analysis for assessment

  5. Outline of Course • Monday • Review of introductory statistics • Tuesday • Dummy variables • Multiple Regression • Wednesday • Multi-factor ANOVA • Thursday • Review • Paper critiques • Friday • Assessment

  6. Monday 9am-12pm Session • Variable types and roles • Analysis of a single variable • Numerical measures • Graphical displays • Describing a distribution • The Normal distribution • Other common distributions • Types of data collection • Hypothesis testing

  7. Variable Types • A categorical or qualitative variable places individuals into one of several groups or categories • Nominal – names are assigned to objects as labels • Example: gender, college major, computer brand • Ordinal – there is an inherent order to the categories • Example: grade – A, B, C, D; year in school – freshman, sophomore, junior, senior

  8. Variable Types • A quantitativevariable takes numerical values for which arithmetic operations make sense • Discrete – takes countably many values • Example: number of students in a class • Continuous – any value is possible within the limits of the variable range • Example: weight, temperature, number of words per minute typed

  9. Variable Roles • A response or dependent variable measures an outcome of a study • An explanatory or independent variable explains or causes change in the response variable • If we want to test the effect of the type of keyboard on typing speed, typing speed is the response variable and type of keyboard is the explanatory variable

  10. Data • Collected data is usually denoted as • X1, X2, . . . , XN • Response variables are usually denoted as • Y1, Y2, . . . , YN • The N indicates the total number of observations • The type of data will dictate both which numerical measures and graphical displays to use in informal exploratory data analysis (EDA) and which tests to use in formal statistical analysis

  11. Numerical Summaries:Categorical Data • Counts • Number of students using each different type of computer interface • Frequency or percents • Percent of total number of students

  12. Numerical Summaries:Categorical Data • Which is better, counts or percents? • Depends on the situation and type of data. • Suppose I tell you that 10 students who used Tutor I received an A on the final and 10 students who used Tutor II received an A on the final. However, there were 20 students who used Tutor I and 100 students who used Tutor II. • Knowing 50% of students who used Tutor I received an A compared to 10% of students who used Tutor II is more informative. Here we’d want to use the percent.

  13. Graphical Summaries:Categorical data Bar chart Pie Chart

  14. Graphical Summaries:Categorical data Bar chart Pie Chart

  15. Graphical Summaries:Categorical data • Bar charts show how things are split into categories • Pie charts are useful when we have categories that add up to the whole • i.e., showing the percent of the whole in each group • So, which should you use? Depends on the data and what you’d like to show.

  16. Numerical Summaries:Quantitative data • Measures of Center • Mean (arithmetic) • Median • Number such that 50% of the data is below it and 50% is above it • Usually denoted with M

  17. Examples of finding the median • Arrange the data from lowest to highest • If there is an odd number of data points, the median is simply the number in the middle • 15 18 21 24 26 29 30 • The median is 24 • If there is an even number of data points, the median is the average of the two data points in the middle • 15 18 21 24 26 29 • The median is 22.5 • 18 21 24 26 29 30 • The median is 25

  18. Percentiles & Quartiles • The pth percentile is the value such that p% of the data is below it • The median is the 50th percentile • The quartiles break the data into quarters and are the 25th, 50th, 75th, and 100th percentiles • Q1 = 25th percentile • Q2 = 50th percentile = median • Q3 = 75th percentile • Q4 = 100th percentile = maximum • 5-number summary: Q0 = minimum, Q1, Q2 = M, Q3, Q4 = maximum

  19. Numerical Summaries: Quantitative data • Measures of Spread • Variance • N vs. (N-1) • Standard deviation is square root of variance and is denoted by ¾, often used because it is measured in the same units as the data • Inter-quartile range (IQR) • middle 50% of the data • IQR = Q3 – Q1

  20. A note on notation • Parameters refer to true population values • Examples • ¹, ¾2 • Statistics refer to estimates derived from a particular sample. These values vary from sample to sample • Examples

  21. Histogram Box-plot Graphical Summaries:Quantitative Data

  22. Graphical Summaries:Quantitative Data • Box-plots are a graphical display of the 5-number summary • Histogram vs. box-plot • Histograms and box-plots give you different information • Histograms show the shape • Box-plots give you the quartiles • With practice and experience you can estimate quartiles from histograms and generalize shape from box-plots • Use numerical summaries to augment either!

  23. Histograms • A note of caution, the appearance of a histogram is susceptible to the number of bins used

  24. Examining the Distribution • Overall pattern • Shape, center, spread • Outliers

  25. Examining the Distribution • Overall pattern • Number of modes / number of bumps

  26. Examining the Distribution • Overall Pattern • Symmetric – values above and below the midpoint are mirror • Skewed – one tail is longer than the other; skewed in the direction of the longer tail

  27. Examining the Distribution • Measure of Center • If symmetric use mean • If skewed use median • Measure of Spread • If symmetric use variance/standard deviation • If skewed use IQR • Symmetric distribution • Mean and variance/standard deviation • Skewed distribution • Median and IQR

  28. Mean vs. Median • The relationship between the mean and median depends on the shape of the distribution • Symmetric – mean = median • Skewed-left – median < mean • Skewed-right – median > mean

  29. Mean vs. Median • The median is robust to outliers. The mean is not. • Suppose our data is • 15 18 21 24 26 29 30 • Mean = 23.2 • Median = 24 • Suppose the last data point accidentally gets entered as 300 instead • 15 18 21 24 26 29 300 • Mean = 61.9 • Median = 24

  30. Outliers • Outliers are observations that are numerically distant from the rest of the data • Typically a data point x is called an outlier if • x < Q1 – 1.5*IQR Or • x > Q3 + 1.5*IQR • Different software will use slightly different methods to calculate cutoff values for outliers

  31. Outliers • If the data was entered by hand, first make sure there was no entry error • Also check to make sure there was no measurement error • If no error, then an outlier shows a flaw in the assumed theory and requires more thought and investigation by the researcher

  32. Outliers • Graphically, outliers appear as follows

  33. EDA for a single variable • See handout for flow chart

  34. Normal or Gaussian Distribution • A uni-modal, bell-shaped distribution • Characterized by mean and standard deviation or variance • Notation X ~ N(¹,¾) or X ~ N(¹,¾2) • Where ~ is read as “is distributed as” • Approximately fits many datasets • Scores on tests taken by many people (SAT, ACT, GRE) • Biological characteristics (height)

  35. Why is the Normal Distribution so important? • Models many datasets • Many statistical tests are based on the assumption of normality • The Normal distribution arises as the limiting distribution of many continuous and discrete distributions • Many other distributions can be approximated by the Normal distribution

  36. Examples of Normal Distributions

  37. 68-95-99.7% Rule • 68% of the data is within § 1 ¾ of the ¹ • 95% of the data is within § 2 ¾ of the ¹ • 99.7% of the data is with § 3 ¾ of the ¹

  38. Graphical Representation of 68-95-99.7% Rule

  39. 68-95-99.7% Rule Examples • If ¹ = 0, ¾ = 1 • 68% within -1 to 1 • 95% within -2 to 2 • 99.7% within -3 to 3 • If ¹ = 1, ¾ = 1 • 68% within 0 to 2 • 95% within -1 to 3 • 99.7% within -2 to 4 • If ¹ = -1, ¾ = 1.5 • 68% within -2.5 to 0.5 • 95% within -4 to 2 • 99.7% within -5.5 to 3.5

  40. Standard Normal • A standard normal variable has mean 0 and standard deviation 1 • Notation Z ~ N(0,1) • If X ~ N(¹, ¾), then (X-¹)/¾ ~ N(0,1) • If Z ~ N(0,1) then Z¾ + ¹ ~ N(¹,¾) • Easier to do calculations in terms of standard normal

  41. Standard Normal • If X is an observation from a N(¹,¾) distribution, the standardized value is • Z = (X-¹)/ ¾ • Referred to as a z-score • The z-score tells us how many standard deviations the observation falls from the mean and on which side • Areas under the standard normal curve are tabled and this saves tedious calculations • For problems with a normal variable, convert to standard normal, solve, and convert back

  42. Normal Examples • Suppose X ~ N(266,16) • What is the z-score of 298? • Z = (X-¹)/¾ = (298-266)/16 = 32/16 = 2 • We have z-score of -1.5, what X value does this translate to? • X = Z¢¾ + ¹ = -1.5¢16 + 266 = 242

  43. More Normal Examples • If X~N(100,10), what is the probability of observing a value less than 90? • Calculate z-score: Z = (90-100) / 10 = -1 • Look up -1 on table • Probability is 0.1587 • If X~N(50,10), what is the probability of observing a value less than 70? • Calculate z-score: Z = (70-50) / 10 = 2 • Look up 2 on table • Probability is 0.9773

  44. Other Distributions • The following distributions often appear when doing hypothesis testing. The important thing is to understand how the distributions are characterized. • The t-distribution • Similar to the normal distribution but has more mass in the tails • Characterized by the degree of freedom • The F-distribution • Characterized by 2 degrees of freedom • The Â2-distribution • Characterized by the degree of freedom

  45. Methods of Data Collection • Anecdotal evidence • Stories that you’ve heard • Available data • Data online or collected by a colleague • Voluntary response sample • Online surveys • Observational study • Observe something happening • Experiment • You control one or more of the variables

  46. Types of Studies • Descriptive • Designed to document what is going on • Relational • Studies the relationship between two or more variables • Causal • Designed to determine if one or more explanatory variables causes one or more response variables

  47. Experiments • The only way to draw causal conclusions! • Control group and treatment group(s) • Example: when testing a new computer interface, the new interface is the treatment and the old (or standard) interface is the control • If we’re testing an old interface versus two new interfaces, then we have two treatment groups

  48. Hypothesis testing • A statistical method to test a theory • A formal procedure which compares observed data to a hypothesis whose truth we want to assess • Have a null (H0) and an alternative (H1) hypothesis • The hypotheses are statements about parameters in a population or model

  49. Hypothesis Testing • The null hypothesis typically states no change or difference • The alternative hypothesis states that there is a change or difference • In most cases you want the alternative hypothesis to be true

  50. Hypothesis Testing • Suppose we design a new interface for an online tutor and would like to test if it leads to better test scores. • The null would be that the old interface leads to better scores or that there is no difference and the alternative is that the new interface leads to better scores. • If ¹old and ¹new are the mean test scores for students using the old and new interfaces, then • H0 = ¹old ≥ ¹new • H1 = ¹old < ¹new

More Related