Regression for Data Mining. Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan. Learning Objectives. To understand the application of regression analysis in data mining Linear/nonlinear Logistic (Logit) To understand the key statistical measures of fit

Download Presentation

Regression for Data Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Learning Objectives • To understand the application of regression analysis in data mining • Linear/nonlinear • Logistic (Logit) • To understand the key statistical measures of fit • To learn how to run and interpret regression analyses using SAS Enterprise Miner software

Analysis of Association In business problems interests often go beyond the statistical testing of differences (e.g., female versus male preferences) Often interested in degree of association between variables. Regression is one of the techniques that helps uncover those relations.

Expected value of y (outcome) Intercept Term Predictor variable coefficient Linear Regression Analysis • Analysis of the strength of the linear relationship between predictor (independent) variables and outcome (dependent/criterion) variables. • In two dimensions (one predictor, one outcome variable) data can be plotted on a scatter diagram. E(y) = b0 + b1 (x)

Sample Data: x y x1 y1 . . . . xnyn Estimated Regression Equation Sample Statistics b0, b1 Estimation Process Regression Model y = b0 + b1x +e Regression Equation E(y) = b0 + b1x Unknown Parameters b0, b1 b0 and b1 provide estimates of b0 and b1

Example • List Variables we have • Determine a DV of interest • Is there a way to predict DV?

where: yi = observed value of the dependent variable for the ith observation ^ yi = estimated value of the dependent variable for the ith observation Least Squares Method • Least Squares Criterion: minimize error (distance between actual data & estimated line)

_ _ x= mean value for independent variable y= mean value for dependent variable Least Squares Method • y-Intercept for the Estimated Regression Equation where: xi = value of independent variable for ith observation yi = value of dependent variable for ith observation n = total number of observations

Predicted Line Actual Data Least Squares Estimation Procedure • Least Squares Criterion: The sum of the vertical deviations (y axis) of the points from the line is minimal.

Example Results Let X = Temp, Y = Kwatts Y = 319.04 + 185.27 X

SST = SSR + SSE where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error Coefficient of Determination • How “strong” is relationship between predictor & outcome? (Fraction of observed variance of outcome variable explained by the predictor variables). • Relationship Among SST, SSR, SSE

Kwatts vs. Temp Example df SS Regression 1 58784708.31 Residual 10 38696916.69 Total 11 97481625 r2 = 0.603033734 Does the linear regression provide a good fit?

Assumptions About the Error Term e 1. The erroris a random variable with mean of zero. 2.The variance of , denoted by 2, is the same for all values of the independent variable. 3.The values of are independent. 4.The erroris a normally distributed random variable.

Significance Test for Regression Is the value of b1zero? Two tests are commonly used: F Test t Test and Both thettest and Ftest require an estimate of the variance (s2) of the error (e). As in most of our statistical work, we are working with a sample, not the population, so we use mean square error (s2).

Testing for Significance • An Estimate of s • To estimate swe take the square root of s 2. • The resulting sis called the standard error of • the estimate.

Testing for Significance: t Test • Hypotheses: Coefficient (b1) is 0 (no relationship between predictor & outcome) • Calculating t Statistic:

Testing for Significance: t Test 1. Determine if 2. Specify the level of significance. a = .05 3. Select the test statistic. 4. State the rejection rule. Reject if p-value < .05 or |t| > 3.182 (with 3 degrees of freedom)

Testing for Significance: FTest • Reject if: p-value<a or F>F F = MSR/MSE where: Fis based on an Fdistribution with 1 degree of freedom in the numerator and n- 2 degrees of freedom in the denominator

Testing for Significance: FTest 1. Determine if 2. Specify the level of significance. a = .05 3. Select the test statistic. F = MSR/MSE 4. State the rejection rule. Reject if p-value < .05 or F > 10.13 (with1 d.f. in numerator and 3 d.f. in denominator)

Standard Error of the Estimate • Standard Error of Estimate has properties analogous to those of standard deviation. • How “good” is our “fit”? • Interpretation is similar: • ~68% of outcomes/predictions within one sest. • ~95% of outcomes/predictions within two sest.

Kwatts vs. Temp Example ANOVA df SS MS F Significance F Regression 1 58784708.31 58784708.31 15.19 0.002972726 Residual 10 38696916.69 3869691.669 Total 11 97481625 Coefficients Standard Error t Stat P-value Intercept 319.0414124 3260.412811 0.097853073 0.923982528 Temp 185.2702073 47.53479059 3.897570706 0.002972726 Is the regression model statistically significant? Is the coefficient of Temp significant?

Cautions about Interpreting Significance Tests • Statistical significance does not mean linear relationship between x and y. • Relationship between x and ydoes not mean a cause-and-effect relationship is present between x and y.

SAS Enterprise Miner • These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3 • Using SAS Enterprise Miner requires the following steps: • Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1 • Create a project in Enterprise Miner • Within the project: • Create a data source using your SAS data file • Create a diagram that includes a data node and a regression node and a multiplot node for graphs • Run the model in the diagram and review the results

Change Role of KWatts to target (outcome variable); change Level of both KWatts and Temp to interval (continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats – we will do this later