1.26k likes | 1.8k Views
Regression. Shibin Liu SAS Beijing R&D. Agenda. 0 . Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary. 2. Agenda. 0. Lesson overview 1. Exploratory Data Analysis
E N D
Regression Shibin Liu SAS Beijing R&D
Agenda • 0. Lesson overview • 1. Exploratory Data Analysis • 2. Simple Linear Regression • 3. Multiple Regression • 4. Model Building and Interpretation • 5. Summary 2
Agenda • 0. Lesson overview • 1. Exploratory Data Analysis • 2. Simple Linear Regression • 3. Multiple Regression • 4. Model Building and Interpretation • 5. Summary 3
Lesson overview Response Variable Predictor Variable + ANOVA 4
Lesson overview Continuous Continuous Correlation analysis Linear regression 5
Lesson overview Continuous response Continuous predictor Correlation analysis • Measure linear association • Examine the relationship • Screen for outliers • Interpret the correlation 6
Lesson overview Continuous response Continuous predictor Linear regression • Define the linear association • Determine the equation for the line • Explain or predict variability 7
What do you want to examine? Lesson overview Descriptive Statistics Inferential Statistics The location, spread, and shape of the data’s distribution The relationship between variables The difference between groups on one or more variables How many groups? Summary statistics or graphics? Which kind of variables? Categorical response variable Summary statistics Both Continuous only Two Two or more ONE-WAY FREQUENCIES & TABLE ANALYSIS SUMMARY STATISTICS CORRELATIONS TTEST DISTRIBUTION ANALYSIS Frequency tables, chi-square test Descriptive Statistics Descriptive Statistics, histogram, normal, probability plots LINEAR MODELS LOGISTIC REGRESSION LINEAR REGRESSION Analysis of variance 8
Agenda • 0. Lesson overview • 1. Exploratory Data Analysis • 2. Simple Linear Regression • 3. Multiple Regression • 4. Model Building and Interpretation • 5. Summary 9
Exploratory Data Analysis: Introduction Height Weight Continuous variable Continuous variable Linear regression Scatter plot Correlation analysis Exploratory data analysis 10
Exploratory Data Analysis: Objective • Examine the relationship between continuous variable using a scatter plot • Quantify the degree of association between two continuous variables using correlation statistics • Avoid potential misuses of the correlation coefficient • Obtain Pearson correlation coefficients 11
Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables Scatter plot Correlation analysis Relationship Trend Range Outlier Communicate analysis result X: Predict variable Y: Response variable Coordinate: values of X and Y Exploratory data analysis 12
Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables ? Model Terms2 Squared Quadratic 13
Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables Scatter plot Correlation analysis Correlation analysis Linear association Negative Zero Positive Exploratory data analysis 14
Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables Person correlation coefficient: For population For sample 15
Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables Person correlation coefficient: r -1 0 +1 Correlation analysis No linear relationship Strong negative linear relationship Strong positive linear relationship 16
Exploratory Data Analysis: Hypothesis testing for a Correlation Correlation Coefficient Test H0: 0 Ha: 0 • A p-value does not measure the magnitude of the association. • Sample size affects the p-value. • Rejecting the null hypothesis only means that you can be confident that the true population correlation is not 0. small p-value can occur (as with many statistics) because of very large sample sizes. Even a correlation coefficient of 0.01 can be statistically significant with a large enough sample size. Therefor, it is important to also look at the value of r itself to see whether it is meaningfully large. 17
Exploratory Data Analysis: Hypothesis testing for a Correlation -1 0 +1 r r r r 0.81 0.72 18
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect Correlation does not imply causation Besides causality, could other reasons account for strong correlation between two variables? 19
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect Correlation does not imply causation Weight Height A strong correlation between two variables does not mean change in one variable causes the other variable to change, or vice versa. 20
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect Correlation does not imply causation 21
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect Correlation does not imply causation 22
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect ? SAT score bounded to college entrance or not X: the percent of students who take the SAT exam in one of the states Y: SAT scores 23
Exploratory Data Analysis: Avoiding Common Errors: Types of Relationships ? Pearson correlation coefficient: r -> 0 curvilinear parabolic quadratic 24
Exploratory Data Analysis: Avoiding Common Errors: outliers Data one Data two r=0.02 r=0.82 25
Exploratory Data Analysis: Avoiding Common Errors: outliers What to do with outlier? ? Why an outlier Valid Compute two correlation coefficients Error Collect data Report both coefficients Replicate data 26
Exploratory Data Analysis: Scenario: Exploring Data Using Correlation and Scatter Plots Fitness oxygen consumption ? 27
Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots 28
Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots What’s the Pearson correlation coefficient of Oxygen_Consumptionwith Run_Time? What’s the p-value for the correlation of Oxygen_Consumptionwith Performance? 29
Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots 30
Exploratory Data Analysis: Examining Correlations between Predictor Variables 31
Exploratory Data Analysis: Examining Correlations between Predictor Variables What are the two highest Pearson correlation coefficient s? 32
Exploratory Data Analysis Question 1. The correlation between tuition and rate of graduation at U.S. college is 0.55. What does this mean? The way to increase graduation rates at your college is to raise tuition Increasing graduation rates is expensive, causing tuition to rise Students who are richer tend to graduate more often than poorer students None of the above. Answer: d 33
Agenda • 0. Lesson overview • 1. Exploratory Data Analysis • 2. Simple Linear Regression • 3. Multiple Regression • 4. Model Building and Interpretation • 5. Summary 34
Simple Linear Regression: Introduction -1 0 +1 Variable A Variable B Variable C Variable D Linear relationships 36
Simple Linear Regression: Introduction r Same r Different 37
Simple Linear Regression: Introduction Simple Linear Regression Y: variable of primary interest Regression Line X: explains variability in Y 38
Simple Linear Regression: Objective • Explain the concepts of Simple Linear Regression • Fit a Simple Linear Regression using the Linear Regression task • Produce predicted values and confidence intervals. 39
Simple Linear Regression: Scenario: Performing Simple Linear Regression Simple Linear Regression Fitness Run_Time Oxygen_Consumption Linear regression 40
Simple Linear Regression: The Simple Linear Regression Model 41
Simple Linear Regression: The Simple Linear Regression Model Question 2. • What does epsilon represent? • The intercept parameter • The predictor variable • The variation of X around the line • The variation of Y around the line Answer: d 42
Simple Linear Regression: How SAS Performs Linear Regression Method of least square Minimize Best Linear Unbiased Estimators . Are unbiased estimators . Have minimum variance 43
Simple Linear Regression: Measuring How Well a Model Fits the Data Regression model Baseline model VS. 44
Simple Linear Regression: Comparing the Regression Model to a Baseline Model Base line model: Better model: Explain more variability 45
Simple Linear Regression: Hypothesis Testing for Linear Regression Linear regression 46
Simple Linear Regression: Assumptions of Simple Linear Regression Linearregression Assumptions: 1 .The mean of Y is linearly related to X. 2. Errors are normally distributed 3. Errors have equal variances. 4. Errors are independent. 47
Simple Linear Regression: Performing Simple Linear Regression Task >Regression>Linear Regression 48
Simple Linear Regression: Performing Simple Linear Regression Task >Regression>Linear Regression 49
Simple Linear Regression: Performing Simple Linear Regression Question 3. In the model Y=X, if the parameter estimate (slope) of X is 0, then which of the following is the best guess (predicted value) for Y when X is equals to 13? 13 The mean of Y A random number The mean of X 0 Answer: b 50