BUSH 632: Getting Beyond Fear and Loathing of Statistics

BUSH 632: Getting Beyond Fear and Loathing of Statistics Lecture 1 Spring, 2007

Don’t Panic • Motivation: this course is about the connection between theoretical claims and empirical data • What we’ll cover (after a very brief review): • Part 1: bi-variate regression • Part 2: multiviariate regression • Part 3: logit analysis and factor analysis

The place of statistical analysis • Programs, policies, legislation typically consist of sets of normative claims and a (sketchy?) theory about how to achieve objectives • Policies typically attempt to map a set of beliefs and empirical claims into society, the economy, international relations. (E.g., welfare reform) • Policy analysts need to be able to identify the values served, distill the theory, and evaluate its empirical claims.

The place of statistical analysis • Ingredients of strong empirical research • Theory  claims for policy (and counter-claims) • Hypotheses  measurement  analysis • Findings  Back to theory… • Implications for policy • Characterizing data • Data Quality: Valid? Reliable? Relevant? • Appropriate model design and execution • Are statistical models appropriate to test hypotheses? • Are models appropriately specified? • Do data conform to statistical assumptions?

How to survive this class • Use the webpage • http://www.tamu.edu/classes/bush/hjsmith/courses/bush632.html • Lectures and book: as close as possible • Readings: Read ‘em or weep. • Questions: Bring ‘em to class, office hours • Stata: Use it a lot • In-class examples and exercises • Download exercises and data in advance • The place of exercises in Bush 632 • Nothing late; don’t miss class…

Class Exams • Three Take-Home Exams • Characteristics and Grading Criteria • Connection to theory • Clear hypotheses • Appropriate statistical analyses • Clear and succinct explanations • Class Data Will Be Provided • From the text • www.aw-bc.com/stock_watson • From Us • On the Class Webpage

A Brief Refresher on Functions and Sampling • Statistical models involve relationships • Relationships imply functions • E.g.: Coffee consumption and productivity • Functions are ubiquitous (or chaos prevails) • Most general expression: Y f (X1, X2, … Xn, e)

Linear Functions

Non-Linear Functions

More Non-Linear Functions

Functions in Policy • Welfare and work incentives • Employment = f(welfare programs, …) Pretty complex • Nuclear deterrence • Major power military conflict = f(nuclear capabilities, proliferation, …) • Educational Attainment • Test Scores = f(class size, institutional incentives, …) • Successful Program Implementation • Implementation = f(clarity, public support, complexity…)

Sampling is also ubiquitous • “Knowing” a person: we sample • “Knowing” places: we sample • Samples are necessary to identify functions • Samples must cover relevant variables, contexts, etc. • Strategies for sampling • Soup and temperature: stir it • Stratify sample: observations in appropriate “cells” • Randomize

Statistics Refresher: Topics • Characteristics of sampling distributions • Class Data • 2005 National Security Survey (phone and web) • Stata application • Means, Variance, Standard Deviations • The Normal Distribution • Medians and IQRs • Box Plots and Symmetry Plots • Central tendency • Expected value and means • Dispersion • Population variance, sample variance, standard deviations • Measures of relations • Covariation • covariance matrices • Correlations • Sampling distributions

Measures of Central Tendency In general: E[Y] = µY For discrete functions: For continuous functions: An unbiased estimator of the expected value:

Rules for Expected Value • E[a] = a -- the expected value of a constant is always a constant • E[bX] = bE[X] • E[X+W] = E[X] + E[W] • E[a + bX] = E[a] + E[bX] = a + bE[X]

Measures of Dispersion • Var[X] = Cov[X,X] = E[X-E[X]]2 • Sample variance: • Standard deviation: • Sample Std. Dev:

Rules for Variance Manipulation • Var[a] = 0 • Var[bX] = b2 Var[X] • From which we can deduce: Var[a+bX] = Var[a] + Var[bX] = b2 Var[X] • Var[X + W] = Var[X] + Var[W] + 2Cov[X,W]

Measures of Association • Cov[X,Y] = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] • Sample Covariance: • Correlation: • Correlation restricts range to -1/+1

Rules of Covariance Manipulation • Cov[a,Y] = 0 (why?) • Cov[bX,Y] = bCov[X,Y] (why?) • Cov[X + W,Y] = Cov[X,Y] + Cov[W,Y]

Covariance Matrices Correlation Matrices (Example) . correlate ahe yrseduc (obs=2950) | ahe yrseduc -------------+------------------ ahe | 1.0000 yrseduc | 0.3610 1.0000 Figure 5.3 Annual Hourly Earnings and Years of Education (Stock & Watson p. 165)

Characterizing Data • Rolling in the data -- before modeling • A Cautionary Tale • Sample versus population statistics ConceptSample StatisticPopulation Parameter Mean Variance Standard Deviation

Properties of Standard Normal (Gaussian) Distributions • Can be dramatically different than sample frequencies (especially small ones) Stata • Tails go to plus/minus infinity • The density of the distribution is key: +/- 1.96 std.s covers 95% of the distribution +/- 2.58 std.s covers 99% of the distribution • Student’s t tables converge on Gaussian

ni=300 ni=100 ni=20 Standard Normal (Gaussian) Distributions • So what? • Only mean and standard deviation needed to characterize data, test simple hypotheses • Large sample characteristics: honing in on normal

Order Statistics • Medians • Order statistic for central tendency • The value positioned at the middle or (n+1)/2 rank • Robustness compared to mean • Basis for “robust estimators” • Quartiles • Q1: 0-25%; Q2: 25-50%; Q3: 50-75% Q4: 75-100% • Percentiles • List of hundredths (say that fast 20 times)

Distributional Shapes • Positive Skew • Negative Skew • Approximate Symmetry MdY MdY MdY

Using the Interquartile Range (IQR) • IQR = Q3 - Q1 • Spans the middle 50% of the data • A measure of dispersion (or spread) • Robustness of IQR (relative to variance) • If Y is normally distributed, then: • SY≈IQR/1.35. • So: if MdY ≈ and SY ≈IQR/1.35, then • Y is approximately normally distributed

Example: The Observed Distribution of Annual Household Income (Distribution of income by gender: men=1, women=2)

Interpreting Box Plots Median Income = 15.38 (men), 14.34 (women)

Quantile Normal Plots • Allow comparison between an empirical distribution and the Gaussian distribution • Plots percentiles against expected normal • Most intuitive: • Normal QQ plots • Evaluate

Data Exploration in Stata • Access The Guns dataset from the replication data on the Stock and Watson Webpage • Using Incarceration Rate: univariate analysis Stata • Using Incarceration Rate : split by Shall Issue Laws Stata • Exercises: • Graphing: Produce • Histograms • Box plots • Q-Normal plots

For Next Week • Read Stock and Watson • Chapter 4 • Homework Assignment on Webpage

BUSH 632: Getting Beyond Fear and Loathing of Statistics