Inferential Statistics

Inferential Statistics

Inferential Statistics With inferential statistics you can do the following: • Determine probability of characteristics of population based on the characteristics of your sample. You know what the relationship “looks like” in the sample. You have the actual numbers. Inferential statistics tells you the probability that the same relationship exists in the population. • Assess the strength of the relationship between independent and dependent variables. Inferential statistics allows you to determine if the relationship is statistically significant and helps you decide if it is substantially significant (e.g., is the relationship strong enough to matter).

Inferential Statistics Inferential statistics are used to test hypotheses • Research hypotheses state there is a relationship between two or more variables in your sample. • Initially you do NOT assume there is a relationship in the population (i.e., null hypothesis claims there is NO relationship). • You compute a statistic that indicates the probability that we can reject the null hypothesis and support the research hypothesis (e.g., the same relationship we see in our sample also exists in the population). If it is 95% probable that the sample relationship exists in the population, we say it is significant at the .05 level (5% chance of making an error, or we say there is also a relationship in the population from which this sample was drawn).

How Do We Apply What We Learn From Inferential Statistics? • BEFORE you use any Intervention, you should determine if there is evidence that it works. • For instance, what is the probability that fertilizer will increase crop yield for farmers? • BEFORE you work with any group, you may want to know the characteristics of that group. • For instance, what proportion of abused women are eventually killed by their partner? What is the probability that “risk” increases after they leave the partner? • BEFOREyou make recommendations, you want to understand the probabilities of success. • What is the probability that those who participate in 4-H will graduate from college? Are they significantly more likely to graduate than those who do not participate?

Inferential Statistics Can Answer the Following Questions • Is the Intervention I am currently using worth my time? • Does it work with 5% or 95% of program participants? Are participants more likely than general population to reach goals? • What factors are most important when attempting to increase effectiveness of intervention? • If I use a second intervention, does it increase success by 10%, 15%, or 60%? • Are characteristics of program participants important? • Policy Implications - Is it worth the tax payer’s dollars? • Is this a Spurious Relationship? • Is there a difference between groups receiving intervention and those not receiving it • Is it large enough to warrant use of limited resources? (substantial significance) • Is it large enough to argue that this intervention “works” in different settings/situations?

Example of Applying Information Learned From Inferential Statistics You implement an “on-line” program to improve communication amongst young married couples. • Using measurement of long term objectives, is it successful? • What percent of my respondents stayed married? How much lower is the divorce rate for those who participated than for general population? • What factors are most important? • Does required payment increase success? • Do older/younger couples respond more effectively to this counseling? • Policy Implications - Is it worth Tax Payer’s Dollars? • Is this relationship spurious (i.e., proactive individuals are more likely to seek intervention and also more likely to have good marriages)? • Is the intervention “cost effective”? Is difference big enough to matter? • Is there evidence that this program could work in other settings/situations or with other couples?

Correlation and Regression

Stating Hypothesis for Regression • Null Hypothesis • There is not relationship between any of the independent variables and the dependent variable • Technically, all of the slopes are zero • Research Hypothesis • There is a relationship between at least one of the independent variables and the dependent variable • Technically, at least one of the slopes are zero • This relationship could be positive or negative

Inferential Statistics • First consider bi-variate statistics • Pearson Correlation • When is it used? • When you have a continuous independent variable and a continuous dependent variable. • How do you interpret it? • When the probability associated with the ___ statistics is .05 or less then you can assume there is a relationship between the dependent and independent variable • For instance you may want to know if the number of hours participants spend in your program is positively related to their scores on school exams * NOTE The Pearson Correlation and Bi-variateRegression are very similar

Inferential Statistics • First consider bi-variate statistics • Bi-variate Regression • When is it used? • When you have a continuous independent variable and a continuous dependent (outcome) variable • For instance, you may want to know if the number of hours participants spend in your program is positively related to their scores on school exams • How do you interpret it? • When the probability associated with the F-statistic is .05 or less then you can assume there is a relationship between the dependent and the independent variable • NOTE The Pearson Correlation and Bi-variateRegression are very similar

Pearson Correlation • Consists of a continuous independent and a continuous dependent variable (i.e., X and Y) • A Pearson correlation coefficient is used to estimate the strength of the relationship between X and Y in the population • A Pearson correlation coefficient ranges from -1 to +1 • The closer to -1 or +1 it is, the stronger the relationship between X and Y, and the lower the probability that we would make a mistake if we claimed there is a relationship between X and Y in the population • A scatter plot can give a visual representation of the relationship between X and Y • A scatter plot shows all of the data points/plots and their relationship, using an X and Y axis • On the following slide, respondents’ scores on BETA and SAT were plotted so that there is one data point for someone who scored 1000 on the SAT and 12 on the BETA.

Bi-variate scatterplot showing a strong positive relationship If all of the data points were on the regression line, then the correlation coefficient would be 1. This would indicate that if we know a person’s score on the SAT we can predict their score on the BETA 100% of the time.

Bi-variate scatterplot showing a strong negative/inverse relationship . The regression line or slope indicates where the data points would be if you could predict Y after knowing X 100% of the time. It is the “predicted” Y.

Correlation Matrix • The following slide contains a computer generated correlation matrix. A correlation matrix can provide the following information: • Strength of the relationship between any two of the variables • The probability that you would make a mistake if you claimed any two variables are related in the population • At the top of the correlation matrix, the following information is reported: • The mean of each continuous variable • The sample size • The standard deviation of each continuous variable • The range of scores for each continuous variable • The standard deviation of each continuous variable

Pearson Correlation CoefficientWhat does it tell us about the strength of the relationship between X and Y? Strength of Relationship r value R2 values Perfect 1.0 1.0 Strong .8 .64 Moderate .5 .25 Weak .2 .04 No Relationship 0 0 Weak - .2 - .04 Moderate - .5 - .25 Strong - .8 - .64 Perfect -1.0 -1.0 Strength of relationship (r or R2) • The closer to 1 or -1 the R and R2 are, the stronger the relationship. . Significance • The stronger the relationship the more likely it is significant.

Correlation Matrix SR90 = number of men per every 100 women TPOV90 = % of people living in poverty FHH = % of female headed households EMPMAL = % of males employed EMPFEM90 = % of females employer

The first number in the matrix (marked by the maroon textbox) is the correlation coefficient. It indicates the strength of the relationship. The second number is the probability. It must be .05 or less if you are to generalize to the population. There may be a third number in the matrix. It would indicate the sample size.

Bi-Variate Regression This is a bi-variate regression printout. It focuses specifically on the relationship between two of the variables (e.g. FHH90 and TPOV90) reported in the matrix on the previous slide. Note that the standardized estimate is the same as the correlation coefficient. If you square this number (.51215714) you would get .2623 (the R square). If you square the standardized estimate you always get the R-square. It is the percent increase in you ability to predict Y if you know X. In this example, your ability to predict the poverty rate (TPOV90) in a city increases by 26% if you know the percent of female headed households in that city.. The Prob>F is .0001 This indicates this relationship is significant. We are more than 99% sure that this relationship exists in the population.

This is a SAS printout of a Pearson Correlation Matrix. This matrix reports the relationship between 3 continuous variables (i.e., GPR, grade in school and number of times the student has skipped class).

Types of Regression • Bi-Variate Regression • Continuous dependent variable • Continuous independent variable • Relationship between the two can be negative or inverse, positive, linear or curvilinear. • Multiple Regression • Continuous dependent variable • You use multiple independent variables to predict a continuous dependent variable • For instance, you could use number of hours participating in the program, score on attitude index and age to predict success in school (i.e., GPR) • A variation – you use one or more continuous independent variables and one categorical variable to predict a dependent variable—The categorical variable can have only two categories (i.e., male or female) • For instance, you would use gender, number of hours participating n the program, score on attitude index and age to predict success in school (i.e., GPR)

Regression – continued • Logit Regression • Categorical dependent variable (two categories) • You use continuous independent variables to predict the probability of falling into one category or another • For instance, how does number of hours studying for exams, age, and number of classes skipped, influence the probability that a student will graduate from high school? • Graduation is measured simply as did or did not graduate

Regression Printout This is a copy of a multiple regression printout and includes a brief explanation of the numbers reported.

Interpreting a regression printout Pr > F is <.000l indicating we can reject the null hypothesis and at least one independent variable is significantly related. R-Square is .5187 indicating that our ability to predict posself (self-esteem score) increases by 51% if we know the value of all of the independent variables. Pr . is less than .0001 for grades (abc) and how much like they other students (likestu) indicating that these are the independent variables that are related to posself (self-esteem score).

Interpreting Parameter Estimates • Parameter estimates are how much Y changes for every one unit change in X. From printout on previous slide, we see that for every grade increase (i.e., from C to B) then Posself (self esteem score) increases by .69 or 2/3 of a point. • If the independent variable is a dummy variable then we interpret it slightly differently. It is how much Y changes when we go from one category to the other. From printout on previous slide we see that as we go from the category of male (coded 0) to female (coded 1) then Posselfdecreases by .21688.

Interpreting Parameter Estimates • Caution – When you interpret a parameter estimate, you must consider how you measured the X variables. If your parameter estimate is 1 and you measured money in $10,000 increments, then for every $10,000 you spend on your child, their SAT score would increase by 1. If your parameter estimate is 1 and you measure money in dollars, then for every dollar, their SAT score would increase by 1. $10,000 would result in an increase of 10,000 on their SAT.

Predicting Y • Why and when do we predict Y? • Explain results to a nonacademic audience • Explain regression in an interesting way • How do we predict? • First generate regression printout • THEN use the prediction formula – Y=a+b1x1+b2x2+b3x3+…….Where: • Y=the intercept or constant • b=slope or parameter estimate which tells you the change in Y for every one unit change in X • X=value of each independent variable that you select using the codebook • Y=the predicted value of your dependent variable as predicted by the combination of X variables

Now You can do it

Contact Information Dr. Carol Albrecht USU Extension Assessment Specialist carol.albrecht@usu.edu 979-777-2421

Inferential Statistics