Regression Models

Regression Models Fit data Time-series data: Forecast Other data: Predict

Use in Data Mining • One of major analytic models • Linear regression • The standard – ordinary least squares regression • Can use for discriminant analysis • Can apply stepwise regression • Nonlinear regression • More complex (but less reliable) data fitting • Logistic regression • When data are categorical (usually binary)

OLS (Ordinary Least Square) Model

OLS Regression • Uses intercept and slopecoefficients (b) to minimize squared error terms over all i observations • Fits the data with a linear model • Time-series data: • Observations over past periods • Best fit line (in terms of minimizing sum of squared errors)

Regression Output (page 101) R2 : 0.987 Intercept: 0.642 t=0.286 P=0.776 Week: 5.086 t=53.27 P=0 Requests = 0.642 + 5.086*Week

Time-Series Forecast

Regression Tests • FIT: • SSE – sum of squared errors • Synonym: SSR – sum of squared residuals • R2 – proportion explained by model • Adjusted R2 – adjusts calculation to penalize for number of independent variables • Significance • F-test - test of overall model significance • t-test - test of significant difference between model coefficient & zero • P – probability that the coefficient is zero • (or at least the other side of zero from the coefficient)

Regression Model Tests • SSE (sum of squared errors) • For each observation, subtract model value from observed, square difference, total over all observations • By itself means nothing • Can compare across models (lower is better) • Can use to evaluate proportion of variance in data explained by model • R2 • Ratio of explained squared dependent variable values (MSR) to sum of squares (SST) • SST = MSR plus SSE • 0 ≤ R2 ≤ 1

Multiple Regression • Can include more than one independent variable • Trade-off: • Too many variables – many spurious, overlapping information • Too few variables – miss important content • Adding variables will always increase R2 • Adjusted R2 penalizes for additional independent variables

Example: Hiring Data • Dependent Variable – Sales • Independent Variables: • Years of Education • College GPA • Age • Gender • College Degree

Regression Model Sales = 269025 -17148*YrsEd P = 0.175 -7172*GPA P = 0.812 +4331*Age P = 0.116 -23581*Male P = 0.266 +31001*Degree P = 0.450 R2 = 0.252 Adj R2 = -0.015 • Weak model, no IV significant at 0.10

Improved Regression Model Sales = 173284 - 9991*YrsEd P = 0.098* +3537*Age P = 0.141 -18730*Male P = 0.328 R2 = 0.218 Adj R2 = 0.070

Logistic Regression • Data often ordinal or nominal • Regression based on continuous numbers not appropriate • Need dummy variables • Binary – either are or are not • LOGISTIC REGRESSION (probability of either 1 or 0) • Two or more categories • DISCRIMINANT ANALYSIS (perform regression for each outcome; pick one that fit’s best)

Logistic Regression • For dependent variables that are nominal or ordinal • Probability of acceptance of • case i to class j • Sigmoidal function • (in English, an S curve from 0 to 1)

Insurance Claim Model Fraud = 81.824 -2.778 * Age P = 0.789 -75.893 * Male P = 0.758 + 0.017 * Claim P = 0.757 -36.648 * Tickets P = 0.824 + 6.914 * Prior P = 0.935 -29.362 * Attorney Smith P = 0.776 Can get probability by running score through logistic formula

Linear Discriminant Analysis • Group objects into predetermined set of outcome classes • Regression one means of performing discriminant analysis • 2 groups: find cutoff for regression score • More than 2 groups: multiple cutoffs

Centroid Method(NOT regression) • Binary data • Divide training set into two groups by binary outcome • Standardize data to remove scales • Identify means for each independent variably by group (the CENTROID) • Calculate distance function

Fraud Data

Standardized & Sorted Fraud Data

Distance Calculations

Discriminant Analysis with RegressionStandardized data, Binary outcomes Intercept 0.430 P = 0.670 Age -0.421 P = 0.671 Gender 0.333 P = 0.733 Claim -0.648 P = 0.469 Tickets 0.584 P = 0.566 Prior Claims -1.091 P = 0.399 Attorney 0.573 P = 0.607 • R2 = 0.804 • Cutoff average of group averages: 0.429

Case: Stepwise Regression • Stepwise Regression • Automatic selection of independent variables • Look at F scores of simple regressions • Add variable with greatest F statistic • Check partial F scores for adding each variable not in model • Delete variables no longer significant • If no external variables significant, quit • Considered inferior to selection of variables by experts

Credit Card Bankruptcy PredictionFoster & Stine (2004), Journal of the American Statistical Association • Data on 244,000 credit card accounts • 12-month period • 1 percent default • Cost of granting loan that defaults almost $5,000 • Cost of denying loan that would have paid about $50

Data Treatment • Divided observations into 5 groups • Used one for training • Any smaller would have problems due to insufficient default cases • Used 80% of data for detailed testing • Regression performed better than C5 model • Even though C5 used costs, regression didn’t

Summary • Regression a basic classical model • Many forms • Logistic regression very useful in data mining • Often have binary outcomes • Also can use on categorical data • Can use for discriminant analysis • To classify

Regression Models