**Advanced Research Methods:Regression Analysis Theory and** ModelingBy Erlan Bakiev, ph. D.

**Faculty and Text** • Textbook: • Keith, T. Z., (2006). Multiple Regression and Beyond. Pearson • Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis, 4th ed., 2006

**Lecture Outline** • Overview of Regression Analysis • Example • Guidelines for Group Project (due week 5 class)

**Goals of Today’s Lecture** • What is Regression analysis • Introduction to the most widely used statistical models • Linear regression • Logistic regression • How these models are used to analyze data and inform decisions • When different models are appropriate • How to fit, interpret, and assess different models • Practice with different data sets

**A Note on SPSS** • SPSS stands for Statistical Package for the Social Sciences and it is one of the most widely used programs for statistical analysis in social sciences. • Widely used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others • User Friendly • You will use a basic set of commands for model fitting • Current version of SPSS is 18

**Regression Analysis** • Regression analysis is a statistical tool for the investigation of relationships between variables • İnvestigator seeks to understand the causal effect of one variable upon another, for example • the effect of a price increase upon demand, • the effect of changes in the money supply upon the inflation rate.

**Types of Regression Models** • Two types of regression models • Linear regression • Simple linear regression (continuous Y, one X) • Multiple linear regression (continuous Y, several Xs) • Logistic regression • Binary Y, several Xs • Linear regression forms the basis for understanding most regression techniques

**Explore the Data Graphically** Select a Tentative Structure Estimate the structure and its uncertainty Assess the plausibility of the tentative structure Use the estimated structure for your inferences (suitably qualified by the estimated uncertainty) Steps in a Regression Analysis

**Simple Regression** • Regression analysis with a single explanatory variable is termed “simple regression.” • Ex: Education income relationship • I = α + βE + ε α = a constant amount (what one earns with zero education); where • β = the effect in dollars of an additional year of schooling on in- come, hypothesized to be positive; and • ε = the “noise” term reflecting other factors that influence earn-ings.

**Multiple Regression** • İt is a technique that allows additional factors to enter the analysis separately so that the effect of each can be estimated.” • Simultaneously several independent variables can influence a dependent variable. • Ex: İnfluence of Education and experience on income • The model is as follows: I = α + βE + γX + εwhere “γ” is expected to be positive

**Model-Based Data Analysis** • Data sets consist of a set of observations • Each observation contains • One dependent variable (“Y”) • One or more independent variables (“Xs”) • We want to determine if there is a structure to the data set • Relationship between the response and one or more predictors

**Response = structure(predictors) + “error”** • Structure • Varies from observation to observation • Function of predictors • Systematic, deterministic • “Error” • The error of a sample is the deviation of the sample from the (unobservable) true function value.

**Modeling Process** • Fitting a model then requires: • Selecting a tentative structure • Estimating the structure and its uncertainty • Typically as statistics and their sampling distributions • Assessing plausibility of the choice of structure • Model diagnostics

**Characterizing the Relationship BetweenTwo Variables** • Outline • Types of variables • Correlation • Graphical Methods • Linear Regression

**İndependent and Dependent Variables** • Each observation has two parts: dependent Y and one or more independent Xs • We are interested in the relationship of X and Y • At a minimum, X and Y vary together • X and Y are associated • Statistical relationship does not imply causation Gujarati, D.N. (2003). Basic Econometrics, International Edition - 4th ed.. McGraw-Hill Higher Education. pp. 22-24. ISBN 0-07-112342-3.

**Examples** • X: income Y: consumption • X: parents’ SES Y: education level • X: education level Y: income • X: range to target Y: probability of hit/kill • The statistical methods we will study establish association • Association does not entail causality

**Types of Variables** • Continuous: variable can take any value in a (possibly infinite) range • Money, height, blood pressure, weight • Discrete: variable takes on a countable set of numerical values (often finite and small) • People in a queue, hits on a target • Ordinal: Variable has a finite set of non-numeric, but ordered values (categorical) • Level of schooling, rating • Nominal: Variable is finite, non-numeric, non-ordered • Religion, gender

**Methods of Examining Relationships** • Method X Y • Simple Regression continuous continuous • Multiple Regression continuous,discrete continuous • Logistic Regression continuous,discrete nominal • Other combinations are possible • These methods form the foundation of most other methods

**Continuous X, Continuous Y** • Simplest case • Correlation • Closely related to simple regression • Widely used in some substantive areas to characterize association • Graphical methods • Essential exploratory and diagnostic tool

**A Graphic Illustration**

**Correlation** • Definition • Attributes • Measures how X and Y vary together linearly • Scale-free • Ranges from –1 to +1 • Zero correlation is not necessarily independence • Note that this is pure association, no specification of response/dependent and predictor/independent variables

**Examples of Correlation**

**Problems with Correlation** • Correlation is a single number summary of association • But • Outliers can dominate the value (it is not robust) • Zero correlation does not mean no relation • By itself, it does not allow the prediction of Y from X • This is usually of interest

**Graphical Methods: Two-Way Scatterplot(1993 Car Data)** infile _skip(11) EngSiz _skip(12) Wt _skip(1) using 93cars.dat scatter EngSiz Wt, ytitle("Engine Size (l)") xtitle("Weight (lbs)")

**Linear Regression** • In simple linear regression, we use the following model for the expected value of y given x: • E(y|x) = β0 + β1x • Relationship characterized by two numbers • Straight line • Wide applicability in practical situations • Many relationships approximately linear (or can be made so) • Forms the basis for more sophisticated analysis

**The “Best” Straight Line** • How do we construct the best line? • Some ideas • Minimize sum of distances from each Y to the line • Minimize sum of absolute values of distances from each Y to the line • Minimize sum of squared distances from Y to the line,

**Regression Line for Car Data** regress EngSiz Wt predict pengsiz scatter EngSiz Wt, ytitle("Engine Size (l)") xtitle("Weight (lbs)") || line pengsiz Wt, clc(black) EngSiz = -1.90 + 0.0015 Weight

**Interpretation of Coefficients** • In many applications we are interested in the coefficients directly • β1 is the change in expected value of y for a unit change in x (slope) • β1 = 0 means that x is not related to change in mean of y • β0 is the value of E(y) for x=0 (intercept) • Very often of little interest because reflects choice of origin • Often outside range of x data values

**Fit a Regression LineWe Can Graph the Result**

**What We Need to Have to Do Inference** • Construct 95% confidence interval for the slope • Get p-values for the t-statistic • This will allow us to state whether there is a statistically significant association between the independent and dependent variables • Test the hypothesis that β1 = 0 • SPSS, of course, does the arithmetic • In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The lower the p-value, the less likely the result is if the null hypothesis is true, and consequently the more "significant" the result is, in the sense of statistical significance.

**What does it mean to “explain” variation?** • General idea • Want to assess whether the regression line “explains” the data better than simpler alternative models • Variation left after model fit is “unexplained” • R2 indicates what percent of variability in Y is accounted for by the regression model • Some properties of R2 • 0 R2 1 • R2=0: Regression line is horizontal • R2=1: Data fits perfectly on regression line

**Multiple Regression Models** • Extension of simple regression: • Together the βi are called the regression coefficients

**In Policy Analysis Applicationsİndependent Variable Are** Very Important • İndependent/ Categorical / Dummy Variables • Often at least part of our data is qualitative • Subject is male or female • Location is urban or rural • Others include ethnic group, insured vs. uninsured, high-school vs post-high school • Etc., etc., etc. • In much of the modeling in health most of the variables are qualitative • We need to generate the 0,1 indicators

**Will LM Approach Work for Other Types of Data? ** • Suppose we have 0,1 data • Success/failure, die/live, leave military/stay, ... • Suppose we want to link p with covariates (age, income, disease, etc.)? • Least squares won’t work nicely … let’s take a look!

**Example** • We want to know whether the presence of CHD is related to age. • If we take a group of people of a given age • What is the fraction that have CHD? • Equivalently, what is the probability that a person of a given age has CHD? • We expect the probability to depend on age.

**Example: Coronary Heart Disease**

**Linear Regression on 0,1 CHD Data** Fitted values are out-of-range

**Why A Different Type of Regression?** • Logistic regression is the most commonly used generalization of multiple linear regression • Output data is categorical with 2 categories • Categorical: no metric, no order • Usually coded as 0/1 • Terminology: failure/success • Typical examples: dies/lives, does not/does have condition, does not/does marry, etc. • As we’ve seen, linear regression can be inappropriate

**Interpretation** • A year increase in age means that one is 11% more likely to have CHD than someone a year younger • But the probability that you have CHD is much different depending on age • Logistic regression is often used to see how much additional risk is contributed by some risk factor (e.g. smoking) • The coefficient shows how much the risk factor increases your chances of having some condition • But the probability may still be small

**Group Project** • Each of you will form a group to perform and report on a regression analysis of a set of data you select • Multiple linear regression At least 100 observations (i.e., 100 y’s and 100 x1’s, x2’s, …, xk’s) No more than 1000 observations At least 5 x’s • Data set and analysis goal must have my approval • Analysis report due March 15, 2011 • Teams of up to three students • Must do model fitting, interpretation, and write up results • The paper is 10p max • This includes all text, tables, graphics • Write as memo to explain • What you are trying to do • What you found out • Content will depend on data set and what you find

**SPSS Commands for the Group Project** • Menu clear • Analyze • Regression • Linear • Move dependent and independent variables into boxes • Statistics • Descriptives • Part and Partial Correlations • Plots • Histogram • Normal probability plot • Continue • Stepwise • OK Resources to help you learn and use Stata http://www.ats.ucla.edu/stat/spss