1 / 25

Regression Models

Regression Models. Fit data Time-series data: Forecast Other data: Predict. Use in Data Mining. One of major analytic models Linear regression The standard – ordinary least squares regression Can use for discriminant analysis Can apply stepwise regression

Download Presentation

Regression Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression Models Fit data Time-series data: Forecast Other data: Predict

  2. Use in Data Mining • One of major analytic models • Linear regression • The standard – ordinary least squares regression • Can use for discriminant analysis • Can apply stepwise regression • Nonlinear regression • More complex (but less reliable) data fitting • Logistic regression • When data are categorical (usually binary)

  3. OLS (Ordinary Least Square) Model

  4. OLS Regression • Uses intercept and slopecoefficients (b) to minimize squared error terms over all i observations • Fits the data with a linear model • Time-series data: • Observations over past periods • Best fit line (in terms of minimizing sum of squared errors)

  5. Regression Output (page 101) R2 : 0.987 Intercept: 0.642 t=0.286 P=0.776 Week: 5.086 t=53.27 P=0 Requests = 0.642 + 5.086*Week

  6. Time-Series Forecast

  7. Regression Tests • FIT: • SSE – sum of squared errors • Synonym: SSR – sum of squared residuals • R2 – proportion explained by model • Adjusted R2 – adjusts calculation to penalize for number of independent variables • Significance • F-test - test of overall model significance • t-test - test of significant difference between model coefficient & zero • P – probability that the coefficient is zero • (or at least the other side of zero from the coefficient)

  8. Regression Model Tests • SSE (sum of squared errors) • For each observation, subtract model value from observed, square difference, total over all observations • By itself means nothing • Can compare across models (lower is better) • Can use to evaluate proportion of variance in data explained by model • R2 • Ratio of explained squared dependent variable values (MSR) to sum of squares (SST) • SST = MSR plus SSE • 0 ≤ R2 ≤ 1

  9. Multiple Regression • Can include more than one independent variable • Trade-off: • Too many variables – many spurious, overlapping information • Too few variables – miss important content • Adding variables will always increase R2 • Adjusted R2 penalizes for additional independent variables

  10. Example: Hiring Data • Dependent Variable – Sales • Independent Variables: • Years of Education • College GPA • Age • Gender • College Degree

  11. Regression Model Sales = 269025 -17148*YrsEd P = 0.175 -7172*GPA P = 0.812 +4331*Age P = 0.116 -23581*Male P = 0.266 +31001*Degree P = 0.450 R2 = 0.252 Adj R2 = -0.015 • Weak model, no IV significant at 0.10

  12. Improved Regression Model Sales = 173284 - 9991*YrsEd P = 0.098* +3537*Age P = 0.141 -18730*Male P = 0.328 R2 = 0.218 Adj R2 = 0.070

  13. Logistic Regression • Data often ordinal or nominal • Regression based on continuous numbers not appropriate • Need dummy variables • Binary – either are or are not • LOGISTIC REGRESSION (probability of either 1 or 0) • Two or more categories • DISCRIMINANT ANALYSIS (perform regression for each outcome; pick one that fit’s best)

  14. Logistic Regression • For dependent variables that are nominal or ordinal • Probability of acceptance of • case i to class j • Sigmoidal function • (in English, an S curve from 0 to 1)

  15. Insurance Claim Model Fraud = 81.824 -2.778 * Age P = 0.789 -75.893 * Male P = 0.758 + 0.017 * Claim P = 0.757 -36.648 * Tickets P = 0.824 + 6.914 * Prior P = 0.935 -29.362 * Attorney Smith P = 0.776 Can get probability by running score through logistic formula

  16. Linear Discriminant Analysis • Group objects into predetermined set of outcome classes • Regression one means of performing discriminant analysis • 2 groups: find cutoff for regression score • More than 2 groups: multiple cutoffs

  17. Centroid Method(NOT regression) • Binary data • Divide training set into two groups by binary outcome • Standardize data to remove scales • Identify means for each independent variably by group (the CENTROID) • Calculate distance function

  18. Fraud Data

  19. Standardized & Sorted Fraud Data

  20. Distance Calculations

  21. Discriminant Analysis with RegressionStandardized data, Binary outcomes Intercept 0.430 P = 0.670 Age -0.421 P = 0.671 Gender 0.333 P = 0.733 Claim -0.648 P = 0.469 Tickets 0.584 P = 0.566 Prior Claims -1.091 P = 0.399 Attorney 0.573 P = 0.607 • R2 = 0.804 • Cutoff average of group averages: 0.429

  22. Case: Stepwise Regression • Stepwise Regression • Automatic selection of independent variables • Look at F scores of simple regressions • Add variable with greatest F statistic • Check partial F scores for adding each variable not in model • Delete variables no longer significant • If no external variables significant, quit • Considered inferior to selection of variables by experts

  23. Credit Card Bankruptcy PredictionFoster & Stine (2004), Journal of the American Statistical Association • Data on 244,000 credit card accounts • 12-month period • 1 percent default • Cost of granting loan that defaults almost $5,000 • Cost of denying loan that would have paid about $50

  24. Data Treatment • Divided observations into 5 groups • Used one for training • Any smaller would have problems due to insufficient default cases • Used 80% of data for detailed testing • Regression performed better than C5 model • Even though C5 used costs, regression didn’t

  25. Summary • Regression a basic classical model • Many forms • Logistic regression very useful in data mining • Often have binary outcomes • Also can use on categorical data • Can use for discriminant analysis • To classify

More Related