Some slide material taken from: SAS Education

DSCI 4520/5240 (DATA MINING) DSCI 4520/5240 Lecture 6 Regression Modeling Some slide material taken from: SAS Education

Objectives • Overview of Linear Regression Models • The Stepwise Procedure • Overview of Logistic Regression Models • Interpretation of Logistic Regression coefficients

DATA MINING AT WORK:Telstra Mobile Combats Churn with SAS® As Australia's largest mobile service provider, Telstra Mobile is reliant on highly effective churn management. In most industries the cost of retaining a customer, subscriber or client is substantially less than the initial cost of obtaining that customer. Protecting this investment is the essence of churn management. It really boils down to understanding customers -- what they want now and what they're likely to want in the future, according to SAS. "With SAS Enterprise Miner we can examine customer behaviour on historical and predictive levels, which can then show us what 'group' of customers are likely to churn and the causes," says Trish Berendsen, Telstra Mobile's head of Customer Relationship Management (CRM).

Data Mining in the telecom industry: RingaLing Telecom Until recently, RingaLing, a large public telecommunications company, held the monopoly for the entire telecommunications market. Now privatized and without the advantages of the monopolistic situation, competition is coming from consortiums of foreign denationalized companies, new entrants, and cable companies who are offering very tempting proposals to consumers. RingaLing is losing 40,000 customers every month, and only winning a few of those customers back. They are painfully aware that the cost of keeping an existing customer can be up to ten times lower than the cost of acquiring a new one. They desperately need a cost effective way of decreasing customer churn rate.

Data Mining in the telecom industry: RingaLing Telecom The CEO of RingaLing is worried by his company's falling share price and the rate at which customers are leaving. Despite substantial general price reductions, loyal customers are leaving by the thousands. The CEO gives the marketing director six months to bring the situation under control. Subsequently, Martin Miner, a young and promising marketing analyst is summoned to the Marketing Director’s office and asked to solve this problem. Working with the IT department, Martin explains that he needs a way to be able to access and analyze all the company data.

Getting the lines crossed -- the difficulty of data access RingaLing has over 50 million customer files in addition to billions of call records, and data from both the customer service and the billing departments. They also have some competitive information, including competitor pricing policies and market share by area. The data resides in different offices across the globe, in 12 different file formats, and on seven different platforms. Martin decides the first step is the development of a SAS data warehouse, enabling him to have access to all the data he needs in one place. Using SAS Institute's Rapid Warehousing Methodology, this takes only a matter of weeks. Thanks to the data warehouse, the quality and consistency of the data is much better. All the data is summarized and grouped in a way that makes it easy to get a singular view of individual RingaLing customers. Even if a customer has multiple accounts, for example a mobile phone as well as a fixed phone, the database is smart enough to know that this is one customer instead of two.

The Data Mining Process Now, Martin is ready to start mining. Initially, he is interested in the probability that a given customer will cancel their contract within the next year and the controllable variables that might influence them. If he knows this, he will be able to manage the churn rate (the rate at which customers cancel and subscribe). A sample of the data is taken using the sampling capabilities of SAS software. This ensures that the 10% sample accurately represents the customer base as a whole. Initially, Martin plots the probability that a customer will leave over the next year, and finds it to be fairly consistent and somewhat depressing. Using geographical visualization he is able to highlight certain areas where the customers are most likely to leave, which appear to be around certain major cities. He then decides to explore the data using a 3D scatter plot to see the relation between the size of bill, area they live in, and likelihood to leave. He notes that most of the customers at risk for leaving tend to be those who have the highest and the lowest bills.

The Data Mining Process Martin decides to integrate some more data. First, he looks at his company's competitors and areas in which they provide services, as well as the range of services provided (e.g. business, domestic, international). He then looks at their pricing policy and product details. Martin now uses several data mining techniques to model this information so he can predict whether a customer is likely to leave or not. He uses decision trees to eliminate variables which are not important, and surprisingly finds that income plays a much less important role than he would have expected. Having identified several key variables, he then uses neural networks to build a model which will predict whether a person is likely to leave or not, given their characteristics. Following this, he identifies that the people who are likely to leave are typically either those who have very large bills or very low ones. It appears that those who make international calls are more likely to leave. Another point is that people who make frequent calls to the same numbers are more prone to leaving. At this stage he decides to review his findings and concludes that he needs to introduce more data on the pricing policies of the competitors for different types of products.

The Data Mining Process • Martin then creates a new model and suggests the following strategies: • Special tariff for frequent international callers, enabling them to pay a lower cost per time unit. • Low usage tariffs, giving a lower fixed price for the line rental and then possibly higher costs for the actual calls. • High user tarriffs, giving higher fixed price for line rental and lower costs for the actual calls. • Special prices for frequently called numbers Following the implementation of these strategies, there was a drastic reduction in the number of customers leaving. After only three months, customer churn fell from 40,000 to only 10,000. On top of this, they were able to target products to customers who fit certain optimal profiles. This resulted in a gain of another 20,000 customers a month. Martin Miner has played a key role in this process and is rewarded with a large bonus and pay raise. He is subsequently promoted, and it becomes obvious that the marketing director is grooming him to become his replacement.

Linear Regression Models

Simple Linear Regression Model with one predictor variable: Y = b0 + b1X + residual Y unexplained part of Y (error) explained part of Y The fitted line is a straight line, since the model is of 1st order: X Ŷ = b0 + b1X

Y X Quadratic Regression Quadratic Regression model: Y = b0 + b1X + b2X2

Y X Polynomial Regression 3rd-order Regression model: Y = b0 + b1X + b2X2 + b3X3

For example: if female 0 if male I1 = Indicator (Dummy) Variables Dummy, or indicator, variables allow for the inclusion of qualitative variables in the model

Y X Indicator (Dummy) Variables Model with Indicator variable: Y = b0 + b1X + b2I • Rewrite the model as: • For I = 0, • For I= 1, Y = b0 + b1X Y = (b0+ b2) + b1X

Y X Indicator Variables with interaction Model with Indicator variable: Y = b0 + b1X + b2I + b3 XI • Rewrite the model as: • For I = 0, • For I= 1, Y = b0 + b1X Y = (b0+ b2) + (b1 + b3)X

Two-Tailed Test Ho: 1 = 0 (X provides no information) Ha: 1 ≠ 0 (X does provide information) Test Statistic: b1 sb t = 1 Hypothesis Test on theSlope of the Regression Line For large data sets, reject Ho if |t| > 2

Model Assumptions and Residual Analysis Residuals should have… Residual Plot ^ Y - Y • Randomness • Constant Variance • Normal Distribution ^ Y

Residual Analysis • Violation of the constant variance assumption • How to fix it: Transformation Residual Plot ^ Y - Y ^ Y

Sequence Plot of Residuals ^ Y - Y Y Residual Analysis • Violation of the randomness assumption • How to fix it: Add more predictor variables to explain patterns. • In time series data, add lags of Y or X as predictors: • Yt-1 , X1t-1 , X1t-2 , X2t-1 , etc. ^

Residual Analysis Frequency Histogram of the residuals • Violation of the normality assumption • How to fix it: Transformation (start with easy transformations, such as Log(Y), then continue with bucket transformations, etc.)

Logistic Regression Models and Stepwise Procedures

Linear versus Logistic Regression Linear Regression Logistic Regression Target is an interval variable. Target is a discrete (binary or ordinal) variable. Input variables have any measurement level. Input variables have any measurement level. Predicted values are the mean of the target variable at the given values of the input variables. Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables.

E(Y | X=x) = g(x;w) w0 + w1x1 +…+ wpxp) E(Y | X=x) p (x) = g(x;w) w2 g-1( ) w1 Parametric Models Generalized Linear Model Training Data

( ) 1 - p logit(p ) log p p w0 + w1x1 +…+ wpxp g-1( ) logit(p) 1.0 p 0.5 0.0 0 Logistic Regression Models:the logit transformation log(odds) = Training Data

( ) ( ) ( ) 1 - p 1 - p 1 - p log log log p p p w0 + w1x1 +…+ wpxp ´ = w0 + w1(x1+1)+…+ wpxp w1+w0+w1x1+…+wpxp exp(w1) ´ Changing the Odds = odds ratio Training Data

logit transformation Logistic Regression Assumption Assumption: The logit transformation of the probabilities of the target value results in a linear relationship with the input variables.

( ) ( ) ( ) 1 - p 1 - p 1 - p log log log p p p w0 + w1x1 +…+ wpxp ´ = w0 + w1(x1+1)+…+ wpxp w1+w0+w1x1+…+wpxp exp(w1) ´ Changing the Odds = odds ratio Training Data

Interpretation of Parameter Estimates The logit link function provides the most natural interpretation of the estimated coefficients: • The odds of a reference event is the ratio of P(event) to P(not event). The estimated coefficient of a predictor is the estimated change in the log of P(event)/P(not event) for each unit change in the predictor, assuming the other predictors remain constant • Therefore, the odds ratio coefficients are Multipliers that modify the odds ratio P(event)/P(not event), when a certain predictor variable is increased by one unit

Stepwise Procedures Procedures either choose or eliminate variables, one at a time, in an effort to avoid including variables with either no predictive ability or are highly correlated with other predictor variables • Forward selection Add one variable at a time until contribution is insignificant • Backward elimination Remove one variable at a time starting with the “worst” until R2 drops significantly • Stepwise selection Forward regression with the ability to remove variables that become insignificant

Include X3 Include X6 Remove X2 (When X5 was inserted into the model X2 became unnecessary) Include X2 Remove X7 - it is insignificant Include X5 Include X7 Stop Final model includes X3, X5 and X6 Stepwise Regression An example implementation of the stepwise procedure:

Entry Cutoff Forward Selection Input p-value Profit 0.80 Training Validation 0.50 Step

Stay Cutoff Backward Selection Input p-value Profit 0.80 Training Validation 0.50 Step

Entry Cutoff Stay Cutoff Stepwise Selection Input p-value Profit 0.80 Training Validation 0.50 Step

Logistic Regression in Enterprise Miner Refer to our DONOR_RAW data. A Logistic Regression model for TARGET_B was fit (see PR3 assignment for details). Stepwise selection was applied. The data set was split into Training and Validation sets 24.99% of validation data were misclassified, i.e. classified as donor when in fact a non-donor, or vice-versa Average profit in validation data was $0.26 per person included in that set. This yielded a total profit of $2,280.76 from all persons in the validation set

Logistic Regression in Enterprise Miner The Lift and Response charts below compare the model’s performance using (1) Training and (2) Validation data. The baseline (Lift = 1) corresponds to a 5% response.

Interpretation of Regression Coefficients:Linear Regression When the target is continuous (Linear Regression), the standard interpretation of a slope coefficient is as follows: The slope tells you the change in Y when a particular input X increases by one unit, while all other inputs are kept constant. Example: If a linear regression coefficient is equal to -2.5, that indicates that when the predictor increases, the target variable decreases (since the coefficient is negative). Specifically, for each unit increase in the predictor variable, the target variable decreases by 2.5 units

Interpretation of Regression Coefficients:Logistic Regression When the target is binary (Logistic Regression), the standard interpretation of an odds ratio coefficient is as follows: The odds ratio coefficient is a multiplier on the odds ratio for the target event T (=probability for T / probability for non-T) when a particular input X increases by one unit, while all other inputs are kept constant. Example: If an odds ratio coefficient is equal to 1.05, that indicates that when the predictor increases, the probability for the target event also increases. Specifically, for each unit increase in the predictor variable, the odds ratio P(event/nonevent) gets multiplied by 1.05

Suggested readings • Read the SAS GSEM 5.3 text, chapter 5 (pp. 103-134) • Read the Sarma text, chapter 6 (pp. 235-304)

Some slide material taken from: SAS Education