The Marriage of Market Basket Analysis to Predictive Modeling

351 Views

Download Presentation
## The Marriage of Market Basket Analysis to Predictive Modeling

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**The Marriage of Market Basket Analysis to Predictive**Modeling Sanford Gayle**Market Basket Analysis identifies the rule**/our_company/bboard/hr/café/ … but • How do you use this information? • Can the information be used to develop a predictive model? • More generally, how do you develop predictive models using transactional tables?**Data Mining Software Objectives**• Predictive Modeling • Clustering • Market Basket Analysis • Feature Discovery; that is, improve the predictive accuracy of existing models**Agenda**• Converting a transactional to a modeling table • The curse of dimensionality & possible fixes • A feature discovery process; using market basket analysis output as an input to predictive modeling • A dimensional reduction scheme using confidence**DM Table Structures**• Transactional tables (Market Basket Analysis) Trans-id page spend count id-1 page1 $0 1 id-1 page2 $0 1 id-1 page3 $0 1 id-1 page4 $19.99 1 id-1 page5 $0 1 id-2 page1 $0 1 • Modeling tables (modeling & clustering tools) Trans-id page spend count id-1 . $19.95 5 id-2 . $0 1**Converting Transactional Into Modeling Data**• Continuous variable case - easy • Collapse the spend or count columns via the sum, mean, or frequency statistic for each transaction-id value • Proc sql; create table new as select id,sum(amount) as total from old group by id; • Categorical variable case - challenging • It seems the detail page information is lost when the rows are rolled-up or collapsed • However, with transposition you collapse the rows onto a single row for each id, with each distinct page now being a column in the modeling table and taking the count or sum statistic as its value**The Input Discovery Process**• Existing modeling table contains: id-1, age, income, job-category, married, recency, frequency, zip-code … • New potential predictors per transpose contains: id-1, spend on page1, spend on page2, spend on page3, spend on page4, spend on page5 • Augment existing modeling table with the new inputs and, hopefully, discover new, significant predictors to improve predictive accuracy**Problem with Transpose Method**• Suppose the server has 1,000 distinct pages; the transpose method now produces 1,000 new columns instead of 5 • Sparsity: new columns have a preponderance of missing values; e.g., id-2 will have 5 missing values and the 1 non-missing • Regression, Neural, and Cluster tools struggle with this many variables, especially when there is such a preponderance of the same values (e.g., zeros or missing)**The Curse of Dimensionality**• Suppose interest lies in a second classification column too; e.g., both time (hour) and page visited • Transpose method now produces 1,000+24 new variables, assuming no interest in interactions • If interactions are of interest, then there will be 24,000 (1,000x24) new variable generated**General Fix**• Reduce the number of levels of the categorical variable (e.g., using confidence) • Use the transpose method to convert the transactional to a modeling table • Add the new inputs to the traditional modeling table in an effort to improve predictive accuracy**Creating Rules-Based Dummy Variables**• Obtain rules using market basket analysis • Choose the rule of interest • Identify folks having the rule of interest in their market basket • Create a dummy variable flagging them • Augment the traditional modeling table with the dummy variable • Use the dummy variable as an input or target in a predictive modeling tool**Using SQL to Identify Folks Having a Rule of Interest in**Their Market Basket**Possible Sub-setting Criteria**• Any rule of interest • The confidence - e.g., all rules having confidence >= 100 (optimal level of confidence?) • The support - e.g., all rules having support >= 10 (optimal level of support?) • The lift - e.g., all rules having lift >= 5 (optimal level of lift)**Using Confidence as the Basis for a Reclassification Scheme**• Suppose diapersbeer has a confidence of 100% • Then the two levels “diapers” & “beer” can be mapped into the value “diapersbeer”, it seems • Actually, both the rule and its reverse must have a confidence of 100%**The Confidence Reclassification Scheme**• If confidence for the rule and its opposite is >80, then combine the two levels into the rule-based level • e.g., “page1” & “page2” both mapped into “page1page2” • Using 80 instead of 100 will introduce inaccuracy, but the analyst overwhelmed with too many levels will likely be willing to substitute a little accuracy for dimensional reduction**The Confidence Reclassification Scheme**• Use the transpose method to generate candidate predictors • Augment the traditional modeling table with the new candidate predictors table • Develop an enhanced model using some of the candidate predictors in the hope of fostering predictive accuracy**Contact Information**Sanford.Gayle@sas.com