Relationships in Categorical Data with Intro to Probability

Relationships in Categorical Data with Intro to Probability Concepts in Statistics

The Big Picture

Variables Recall the difference between quantitative and categorical variables: • Quantitative variables have numeric values that can be averaged. A quantitative variable is frequently a measurement – for example, a person’s height in inches. • Categorical variables are variables that can have one of a limited number of values, or labels. Values that can be represented by categorical variables include, for example, a person’s eye color, gender, or home state; a vehicle’s body style (sedan, SUV, minivan, etc.); a dog’s breed (bulldog, greyhound, beagle, etc.).

Two-way Tables As we organize and analyze data from two categorical variables, we make extensive use of two-way tables. Two-way tables for two categorical variables are in some ways like scatterplots for two quantitative variables: they give us a useful snapshot of all of the data organized in terms of the two variables of interest. This will be helpful in finding and comparing patterns.

Categorical Variables The relationship between two categorical variables may be summarized using both • Two-way tables: compactly summarizes totals across the groups. • Conditional percentages: shows the proportions over all the values of the explanatory variable. Conditional percentages are calculated separately for each value of the explanatory variable. When we try to understand the relationship between two categorical variables, we compare the distributions of the response variable for values of the explanatory variable. In particular, we look at how the pattern of conditional percentages differs between the values of the explanatory variable.

Example of a Two-way Table What proportion of the total number of students are Bus-Econ students? 0.077 (or 7.7 %) There is about an 0.08 probability of picking a Bus-Econ major.

Probability • A marginal probabilityis the probability of a categorical variable taking on a particular value without regard to the other categorical variablesuch as P= 6198/12000 • Aconditional probabilityis the probability of a categorical variable taking on a particular value given the condition that the other categorical variable has some particular valuesuch as P=435/925 • Ajoint probabilityis the probability that the two categorical variables each take on a specific valuesuch as P=435/12000

Calculate the probability of a negative outcome When we calculate the probability of a negative outcome like a heart attack, we often refer to the probability as a risk.

Calculate the probability of a negative outcome Calculate the probability of a heart attack: The categorical variables in this case are • Explanatory variable: Treatment (aspirin or placebo) • Response variable: Medical outcome (heart attack or no heart attack)

Calculate the probability of a negative outcome Does aspirin lower the risk of having a heart attack? Compare two conditional probabilities: • The probability of a heart attack given aspirin was taken every other day. • The probability of a heart attack given a placebo was taken every other day. • (heart attack | aspirin) = 139 / 11,037 = 0.013 • (heart attack | placebo) = 239 / 11,034 = 0.022 The result shows that taking aspirin reduced the risk from 0.022 to 0.013.

Calculate the probability of a negative outcome The result from the heart attack study shows that taking aspirin reduced the risk from 0.022 to 0.013. We often compare two risks by calculating the percentage change. We calculate the difference (how much the risk changed) and divide by the risk for the placebo group. Here is the calculation: We conclude that taking aspirin results in a 41% reduction in risk.

Create a hypothetical two-way table to answer more complex questions Will it be a Boy or a Girl?: Assume the following facts are known: • Fact 1: 48% of the babies born are female. • Fact 2: The proportion of girls correctly identified is 9 out of 10. • Fact 3: The proportion of boys correctly identified is 3 out of 4. Here are the questions we want to answer: • Question 1:If the examination predicts a girl, how likely is it that the baby will be a girl? • Question 2:If the examination predicts a boy, how likely is it that the baby will be a boy?

Will it be a Boy or a Girl (continued) Assume we have ultrasound predictions for 1,000 random babies. Let’s consider Fact 1: 48% of the babies born are female. The bottom row gives the distribution of the categorical variable gender of baby. We can use this fact to compute the total number of girls and boys. • 48% girls means that 0.48 (1,000) = 480 are girls. • 52% are boys (100% − 48% = 52% are boys.) So, 0.52(1,000) = 520 boys.

Will it be a Boy or a Girl (continued) Fact 2: The proportion of girls correctly identified is 9 out of 10. • 9 out of 10 is 90% • 90% of the girls are correctly identified: 0.90(480) = 432 • 10% of the girls are misidentified (predicted to be a boy): 0.10(480) = 48 Fact 3: The proportion of boys correctly identified is 3 out of 4. • 3 out of 4 is 75% • 75% of the boys are correctly identified: 0.75(520) = 390 • 25% of the boys are misidentified (predicted to be a girl): 0.25(520) = 130

Will it be a Boy or a Girl (continued) Question 1: If the exam predicts a girl, how likely is it the baby will be a girl? Answer: This is the conditional probability: (girl | predict girl). • So our answer to Question 1 is (girl | predict girl) = 432 / 562 = 0.769. Question 2: If the exam predicts a boy, how likely is it the baby will be a boy? Answer: This is the conditional probability: (boy | predict boy). • So our answer to Question 2 is (boy | predict boy) = 390 / 438 = 0.890. Conclusion: If an ultrasound examination predicts a girl, the prediction is correct about 77% of the time. In contrast, when the prediction is a boy, it is correct 89% of the time.

Probability To summarize: We defined 3 kinds of probabilities related to a two-way table: • A marginal probability is the probability of a categorical variable taking on a particular value without regard to the other categorical variable. • Aconditional probability is the probability of a categorical variable taking on a particular value given the condition that the other categorical variable has some particular value. • Ajoint probability is the probability that the two categorical variables each take on a specific value. When we calculate the probability of a negative outcome, we often refer to the probability as a risk.

Quick Review • What is joint probability? • When we calculate the probability of a negative outcome, what do we refer to the probability as? • What is conditional probability? • What do we create to compute complex probabilities? • When we investigate the relationship between two categorical variables, what do we use to define the comparison groups? • The relationship between two categorical variables can be summarized using?

Relationships in Categorical Data with Intro to Probability