Workshop on Quantitative research methods in institutional research Robert Toutkoushian Professor, Institute of Higher Education University of Georgia
My Statistics Background • PhD in economics, focus on econometrics • Worked in IR for 14 years (7 at the U) • Taught statistics classes at various levels (intro through doctoral) since 1985 • Taught webinars and workshops for AIR on statistical methods • Published studies using a wide range of statistical methods applied to problems in higher education
Organization of Workshop • Part One (morning): Univariate and Bivariate Statistics • Descriptive statistics • Hypothesis tests (two-sample tests) • Bivariate: Correlations, ANOVA, Chi-Square • Linear regression (two variables) • Part Two (afternoon): Multivariate Statistics • Linear regression: (more than two variables) • Extensions of linear regression
Features of Workshop • When to use different procedures and how to interpret results • Less focus on math and details, more on intuition • How to apply statistical methods to issues in institutional research • How to conduct analyses in SPSS • Stop me at any time with questions!!
Three Datasets for Workshop Will use three datasets to illustrate key concepts: • Dataset #1: Graduation Rate Example • Data on 679 4-year public institutions to determine how selectivity and price are related to 6-year graduation rates • Dataset #2: Faculty Salary Example • Data on 432 faculty at an institution to determine whether there are unexplained salary differences by gender • Dataset #3: Student Outcome Example • Data on 1,000 middle school students to determine whether participation in a college readiness program increases their chance of going to and graduating from college
Dataset 1: Graduate Rate Example (Note: Ideally, timing of variables should align for causality (e.g., admit rate in 2004 affects graduation rate six years later))
Objectives of Descriptive Statistics • Always begin your study by carefully studying descriptive statistics for your variables • Helps you understand the data • Helps identify possible problems with data • Goal is to provide a brief description of the data that you will eventually use in more complex analyses • Others cannot see raw data, so you must provide them with an overview of what the data look like to gain credibility
Measures of Central Tendency Goal of central tendency: Try to find the best representative value for a set of numbers. Three options: • Mean = arithmetic average of data points • Median = value in the middle of ordered data points • Mode = value that occurs the most frequently
Measures of Dispersion • Goal of dispersion: Find a measure of how “spread apart” the numbers are within a dataset • Three main measures of dispersion: range, variance, and standard deviation • As the values of these measures increase, the data points are said to be more dispersed or spread apart. • Standard deviation and variance go together • Standard deviation is the most meaningful (same units of measure as the raw data)
Example An IR analyst asks a sample of thirteen students how many courses they have withdrawn from this past year. The results are shown below: Mean = 1.62 Sum of deviations = 0 Sum of squared deviations = 19.08 Variance = 19.08/(13-1) = 1.59 squared number of withdrawals Standard deviation = square root of 1.59 = 1.26 withdrawals
Exercise: Datasets 1-3 • Obtain descriptive statistics for the variables in each of the three datasets provided for the workshop • In SPSS, use “Analyze > Descriptive Statistics > Descriptives” from the menu options • Think about how you would interpret the results from the descriptive statistics
Descriptive Statistics for Dataset #1 (Graduation Rate Study) • Observations: • Average admit rate is 66.45% (range = 16% to 100%) • Average yield rate is 41.36% (range = 7% to 100%) • Average price of attendance is $20,331 (range = $7,740 to $33,015) • Average graduation rate is 44.48% (range = 4% to 100%) • How many missing values are there in the dataset?
Descriptive Statistics for Dataset #2 (Faculty Salary Study) • How do you interpret the means for variables Full, Asso, Asst, and Rank?
Descriptive Statistics for Dataset #3 (Student Outcomes Study) • What percentage of students took part in the college prep program? • What percentage of students were first generation? Lived with both parents? Enrolled in any college? Graduated from any college? • Which variables have missing values? How will this affect your subsequent analyses?
Objectives of Hypothesis Testing • Hypothesis testing is the cornerstone of quantitative analysis at all levels • Goal is to be able to use data from a sample to draw conclusions about what would be found for a larger population • Need to take into account the possibility that the result you obtain was due to chance and not a true association between variables
Logic of Hypothesis Testing • Make an assertion about a population and assume it is true • Draw a sample from the population and calculate the statistic that can be used as an estimator of the population value • Measure the distance between what you assumed you would find and what you actually found • If the distance is “too large,” then evidence is strong enough to reject your assumption
General Steps for Any Hypothesis Test • Specify the null and alternative hypotheses • Select the appropriate test statistic for the random variable in the problem • Formulate a decision rule for rejecting the null hypothesis. • Calculate the value of the test statistic and the corresponding p-value. • Compare the test statistic to the critical value and reach a conclusion (reject or fail to reject the null hypothesis).
Hypothesis Test (two samples) • Goal is to determine if there is a difference between the means for two different populations • In SPSS, use “Analyze > Compare Means > Independent Samples T Test” • Must choose the “test variable” (dependent variable) and the “grouping variable” (defines the two groups) • Results will show test of equality of variances, t-ratio for difference in sample means, and significance level
Example from Dataset #2 • Question: Are female faculty, on average, paid the same as male faculty? Average salary for males = $56,834 Average salary for females = $45,841 Gender gap = $56,834 - $45,841 = $10,993
Dataset #2 Example (cont’d) Test for variance equality. If significant, use results from second row. Significance level (“p-value”) Calculated t-ratio for difference in sample means. Conclusion: There is a difference in the mean salaries for all male and female faculty
Example from Dataset #3 • Question: Are students who take part in college prep program more likely than others to enroll in college? Enrollment rate (in program) = 47% Enrollment rate (not in program) = 55% Gap in enrollment rate = - 8%
Dataset #3 Example (cont’d) • Can use results from first row (no difference in population variances) • Calculated t-ratio = 1.56, p-value = 0.12 • Cannot reject the null hypothesis; conclude that there is no difference in the college enrollment rates between the two groups
Objectives of ANOVA and Chi-Square • Used to determine if there is a difference in the means for a dependent factor when there are more than two groups of interest (an extension of two-sample hypothesis testing) • Chi-square: Used when the dependent factor of interest is categorical (e.g., rank) • ANOVA: Used when the dependent factor of interest is continuous (e.g., faculty salary)
Chi-Square and ANOVA in SPSS • For Chi-Square test, use “Analyze > Descriptive Statistics > Crosstabs” • Need to specify the “row variable” (dependent) and “column variable” (group variable) • Need to choose option for Chi-square test on “Statistics” button • May also want to show expected values on “Cells” button • For one-way ANOVA, use “Analyze > Compare Means > One Way ANOVA” • Need to specify “dependent list” (dependent factor), and “factor” (grouping variable) • May want to choose descriptives from “Options” button
Example from Dataset #2 • Question: Is faculty rank independent of gender? (Use Chi-square test of independence) Calculated χ2 value P-value Breakdown of observed and expected combinations of rank and gender Conclusion: Rank and gender are not independent (there is an association between rank and gender)
Example from Dataset #2 • Question: Do the average citation rates of faculty vary by rank? (Use ANOVA) Average citations are highest for Full Professors, then Associate, then Assistant. But are these differences so large that we can conclude that there is a difference for the entire population??
Example from Dataset #2 (cont’d) P-value Calculated F-ratio • Between groups SS = Variation due to differences in sample means • Within groups SS = Variation due to differences around each sample mean • F-ratio = MS(between) / MS(within) • If all sample means are equal, between group SS = 0 and F-ratio = 0 • Because calculated F-ratio is significant, conclude that there is a difference in the mean citation rates by rank
Objectives of Correlations • Goal is to determine if two variables are associated with each other • Does not specify the direction of causation • Forms the basis for regression analysis • Positive correlation: Both variables move in the same direction • Negative correlation: Both variables move in the opposite direction • Can obtain correlations in SPSS by “Analyze > Correlate > Bivariate”
Example Using Dataset #1 • What are the associations between the four variables in the study of graduation rates? • Grad rates positively associated with price, negatively associated with admit and yield rates. Do these make sense? • Which correlations are the strongest?
Example Using Dataset #3 • What are the correlations between enrolling in college, being in college prep program, living with both parents, being 1st generation, and GPA? What factors are most highly correlated with college enrollment?
Objectives of Linear Regression • Estimate the linear relationship between a dependent variable (Y) and an independent variable (X) of interest (must specify causation) • Use regression model to predict Y given X • Identify magnitude of effect of X on Y • Conduct hypothesis test to determine if X has an effect on Y (“significance test”) • Control for effects of other factors on Y (through multiple regression)
Linear Regression (or OLS) • Goal: Specify the linear relationship between two or more factors • Written as Y = a + bX • Need to know the Y-intercept (a) and the slope (b) to identify the line • Shows how changes in one variable (X) affect another variable (Y) • Example: If Y = GPA and X = SAT score, then Y = 2.00 + 0.02X shows that for each one point increase in SAT, a student’s GPA will rise by 0.02.
Linear Regression in SPSS • To estimate a linear regression equation in SPSS, use “Analyze > Regression > Linear” • Need to specify the dependent and independent variables for the equation • Must decide beforehand which variable is the “dependent” and which is the “independent” (X affects Y and not Y affects X) • Use the “enter” method (default option) • Output is in three parts: • Model Summary (look for R2) • ANOVA (look for F-ratio) • Coefficients (look for slope and intercept, t-ratios, p-values)
Example for Dataset #1 • Question: Does a college’s admit rate have an effect on its graduation rate? • IR Analyst believes that the admit rate (X) affects the graduation rate (Y) and not vice-versa (why?) • Need to find the slope and the intercept for the equation: GradRate = α + β*AdmitRate + ε (Y) (X)
Example for Dataset #1 (cont’d) R2 = 0.018 indicates that 1.8% of variation in graduation rates is “explained” by admit rates F-ratio of 9.208 is significant, indicating that the regression model explains a “significant” amount of variation in graduation rates. [This is a very weak test] (Can think of “Regression SS” = “Between SS” and Residual SS” = “Within SS” form ANOVA)
Example for Dataset #1 (cont’d) P-values Slope = -0.127 Intercept = 56.502 Standard errors for slope and intercept Calculated t-ratios (B / Std. error) GradRate = 56.502 – 0.127*AdmitRate Slope: Each 1% increase in PctAdmit leads to a 0.127% decrease in graduation rate.
Example for Dataset #2 • How does years of experience affect annual salary? Years of experience account for 8.3% of variations in annual faculty salaries Effect of years of experience is statistically significant (t=6.23) Annsal = 45071.959 + 499.419*Yrsexp For each year of experience, annual salary rises by $499.42
Goals of Multiple Regression • Enables the researcher to examine the effects of multiple independent variables on a dependent variable of interest • Models are thus more realistic • Can account for more variation in Y • Can “control” for the effects of certain X’s while focusing on another X • Useful because most X’s are correlated to some degree
Multiple Regression in SPSS • Conducted in the same general way as simple (two-variable) regression, except you specify a list of independent variables • R2 = percent of variation in Y explained by all X’s • Considerations: • Need enough df to estimate model (df = N - #X’s – 1) • Interpretation of slope coefficients change to “partial effects” • If X’s are highly correlated, it is hard to identify “pure” effect • Cannot have perfect collinearity among X’s (arises often with dummy variables and intercept)
Example for Dataset #1 • Question: Does price affect an institution’s graduation rate, even after taking into account selectivity? • Issue: There is likely to be a positive association between price and selectivity (i.e., negative correlation between price and PctAdmit) • Use multiple regression: Y = GradRate, X1 = Price1000, X2 = PctAdmit
Statistics in Multiple Regression • Equation: GradRate = β0 + β1*Price1000 + β2*PctAdmit • β1 = effect of a $1000 increase in price on graduation rate, holding constant PctAdmit • Account for units of measure when comparing X’s • “Betas” show effect sizes (# std. dev. change in Y due to a one std. dev. change in X) • Can also compare magnitudes of t-ratios • Do not use R2 as sole criteria for evaluating model • R2 always rises as X’s are added to the model • Adjusted R2 takes into account whether change in R2 was significant (can also look at significance of new X’s)
Results: Dataset #1 Price and selectivity account for 30.7% of variation in graduation rates Price is significant, PctAdmit is not significant Effect of Price is nearly 19 times greater than PctAdmit Equation: GradRate = 1.715 + 2.371*Price1000 – 0.027*PctAdmit