Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Investigating the relationship between two variables • Generally a statistical relationship exists if the values of the observations for one variable are associated with the values of the observations for another variable • Knowing that two variables are related allows us to make predictions. • If we know the value of one, we can predict the value of the other.
Determining how the values of one variable are related to the values of another is one of the foundations of empirical science. • In making such determinations we must consider the following features of the relationship.
1.) The level of measurement of the variables. Difference varibles necessitate different procedures. • 2.) The form of the relationship. We can ask if changes in X move in lockstep with changes in Y or if a more sophisticated relationship exists. • 3.)The strength of the relationship. Is it possible that some levels of X will always be associated with certain levels of Y?
4.) Numerical Summaries of the relationship. Social scientists strive to boil down the different aspects of a relationship to a single number that reveals the type and strength of the association. • 5.) Conditional relationships. The variables X and Y may seem to be related in some fashion but appearances can be deceiving. Spuriousness for example. So we need to know if the introduction of any other variables into the analysis changes the relationship.
Types of Association • 1.) General Association – simply associated in some way. • 2.) Positive Monotonic Correlation – when the variables have order (ordinal or continuous) high values of one var are associated with high values of the other. Converse is also true. • 3.) Negative Monotonic Correlation – Low values are associated with high values.
Types of Association Cont. • 4.) Positive Linear Association – A particular type of positive monotonic relationship where the plotted values of X-Y fall on a straight line that slopes upward. 5.) Negative Linear Relaionship – Straight line that slopes downward.
Strength of Relationships • Virtually no relationships between variables in Social Science (and largely in natural science as well) have a perfect form. • As a result it makes sense to talk about the strength of relationships.
Strength Cont. • The strength of a relationship between variables can be found by simply looking at a graph of the data. • If the values of X and Y are tied together tightly then the relationship is strong. • If the X-Y points are spread out then the relationship is weak.
Direction of Relationship • We can also infer direction from a graph by simply observing how the values for our variables move across the graph. • This is only true, however, when our variables are ordinal or continuous.
Types of Bivariate Relationships and Associated Statistics • Nominal/Ordinal (including dichotomous) • Crosstabulation (Lamda, Chi-Square Gamma, etc.) • Interval and Dichotomous • Difference of means test • Interval and Nominal/Ordinal • Analysis of Variance • Interval and Ratio • Regression and correlation
Assessing Relationships between Variables • 1. Calculate appropriate statistic to measure the magnitude of the relationship in the sample • 2. Calculate additional statistics to determine if the relationship holds for the population of interest (statistical significance) • Substantive significance vs. Statistical significance
What is a Crosstabulation? • Crosstabulations are appropriate for examining relationships between variables that are nominal, ordinal, or dichotomous. • Crosstabs show values for variables categorized by another variable. • They display the joint distribution of values of the variables by listing the categories for one along the x-axis and the other along the y-axis
Each case is then placed in a cell of the table that represents the combination of values that corresponds to its scores on the variables.
What is a Crosstabulation? • Example: We would like to know if presidential vote choice in 2000 was related to race. • Vote choice = Gore or Bush • Race = White, Hispanic, Black
Measures of Association for Crosstabulations • Purpose – to determine if nominal/ordinal variables are related in a crosstabulation • At least one nominal variable • Lamda • Chi-Square • Cramer’s V • Two ordinal variables • Tau • Gamma
Measures of Association for Crosstabulations • These measures of association provide us with correlation coefficients that summarize data from a table into one number . • This is extremely useful when dealing with several tables or very complex tables. • These coefficients measure both the strength and direction of an association.
Coefficients for Nominal Data • When one or both of the variables are nominal, ordinal coefficients cannot be used because there is no underlying ordering. • Instead we use PRE tests
Lambda (PRE coefficient) • PRE – Proportional Reduction in Error • Two Rules • 1.) Make a prediction on the value of an observation in the absence of no prior information • 2.) Given information on a second variable and take it into account in making the prediction.
Lambda PRE • If the two variables are associated then the use of rule two should lead to fewer errors in your predictions than rule one. • How many fewer errors depends upon how closely the variables are associated. • PRE = (E1 – E2) / E1 • Scale goes from 0 -1
Lambda • Lambda is a PRE coefficient and it relies on rules 1 & 2 above. • When applying rule one all we have to go on is what proportion of the population fit into one category as opposed to another. • So, without any other information, guessing that every observation is in the modal category would give you the best chance of getting the most correct.
Why? • Think of it like this. If you knew that I tended to make exams where the most often used answer was B, then, without any other information, you would be best served to pick B every time.
But, if you know information about each case’s value on another variable, rule two directs you to only look at the members of that new category (variable) and find the modal category (only on that var).
Example • Suppose a sample of 100 voters and you need to predict how they will vote in the general election. • Assume we know that overall 30% voted democrat and 30% voted republican and 40% were independent. • Now suppose we take one person out of the group (John Smith), our best guess would be that he would vote independent.
Now suppose we take another person (Larry Mendez) and again we would assume he voted independent. • As a result our best guess is to predict that all of the voters (all 100) were independent. • We are sure to get some wrong but it’s the best we can do over the long run.
How many do we get wrong? 60. • Suppose now that we know something about the voters regions (where they are from) and we know what proportions various regions voted in the election. • NE-30 , MW – 20, SO – 30 , WE - 20
Lamda – Rule 1 (prediction based solely on knowledge of marginal distribution of dependent variable – partisanship)
Lamda – Rule 2(prediction based on knowledge provided by independent variable )
Lamda –Calculation of Errors • Errors w/Rule 1: 18 + 12 + 14 + 16 = 60 • Errors w/Rule 2: 16 + 10 + 14 + 10 = 50 • Lamda =(Errors R1 – Errors R2)/Errors R1 • Lamda = (60-50)/60=10/60=.17
Lamda • PRE measure • Ranges from 0-1 • Potential problems with Lamda • Underestimates relationship when variables (one or both) are highly skewed • Always 0 when modal category of Y is the same across all categories of X
Chi –Square (c2) • Also appropriate for any crosstabulation with at least one nominal variable (and another nominal/ordinal variable) • Based on the difference between the empirically observed crosstab and what we would expect to observe if the two variables are statistically independent
Background for c2 • Statistical Independence – A property of two variables in which the probability that an observation is in a particular category of on variable and also in a particular category of the other variable equals the simple or marginal probability of being in those categories. • Plays a large role in data analysis • Is another way to view the strength of a relaitionship
Example • Suppose we have two nominal or categorical variables, X and Y. We label the categories for the first category (a,b,c) and those of the second (r,s,t). • Let P(X = a) stand for the probability that a randomly selected case has property a on variable X and P(Y = r) stand for the probability that a randomly selected case has property r on variable Y.
These two probabilities are called marginal distributions and simply refers to the chance that an observation has a particular value on a particular variable irrespective of its value on another variable.
Finally, let us assume that P(X = a, Y = r) stands for the joint probability that a randomly selected observation has both property a and property r simultaneously. • Statistical Independence – The two variables are therefore statisitically independent only if the chances of observing a combination of categories is equal to the marginal probability of choosing one category times the marginal probability of the other.
Background for c2 • P(X = a, Y = r) = [P(X = a)] [P(Y = r)] • For example, if men are as likely to vote as women, then the two variables (gender and voter turnout) are statistically independent because the probability of observing a male nonvoter in the sample is equal to the probability of observing a male times the probability of obseving a nonvoter.
Example • If 100/300 are men & 210/300 voted then; The marginal probabilities are: P(X=m)=100/300 = .33 and P(Y=v) = 210/300 = .7 .33 x .7 = .23 and is our marginal probability
If we know that 70 of the voters are male and take that proportion and divide by the total number of voters (70/300) we also get .23. • We can therefore say that the two variables are independent.
The chi-squared statistic essentially compares an observed result (the table produced by the sample) with a hypothetical table that would occur if (in the population) the variables were statistically independent. • A value of 0 implies statistical independence which means no association.
Chi-squared increases as the departures of observed and expected values grows. There is no upper limit to how big the difference can become but if it is past a critical value then there is reason to reject the null hypothesis that the two variables are independent.
How do we Calc. Chi^2 • The observed frequencies are already in the crosstab. • The expected frequencies in each table cell are found by multiplying the row and the column marginal totals and dividing by the sample size.
Calculating Expected Frequencies • To calculate the expected cell frequency for NE Republicans: • E/30 = 30/100, therefore E=(30*30)/100 = 9
Calculating the Chi-Square Statistic • The chi-square statistic is calculated as: (Obs. Frequencyik - Exp. Frequencyik)2 / Exp. Frequencyik (25/9)+(16/6)+(9/9)+(16/6)+(0)+(0)+(16/12)+(16/8)+(25/9)+16/6)+(1/9)+(0) = 18
The value 9, is the expected frequency in the first cell of the table and is what we would expect in a sample of 100 (with 30 Republicans and 30 north easterners) if there is statistical independence in the population. • This is more than we have in our sample so there is a difference.
Just Like the Hyp. Test • Null : Statistical Independence between x and Y • Alt : X and Y are not independent.
Interpreting the Chi-Square Statistic • The Chi-Square statistic ranges from 0 to infinity • 0 = perfect statistical independence • Even though two variables may be statistically independent in the population, in a sample the Chi-Square statistic may be > 0 • Therefore it is necessary to determine statistical significance for a Chi-Square statistic (given a certain level of confidence)