1 / 45

Some slide material taken from: Groth, Han and Kamber, SAS Institute

Data Mining. Some slide material taken from: Groth, Han and Kamber, SAS Institute. Overview of this Presentation. Introduction to Data Mining The SEMMA Methodology Regression/Logistic Regression Decision Trees SAS EM Demo: The Home Equity Loan Case

andra
Download Presentation

Some slide material taken from: Groth, Han and Kamber, SAS Institute

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Some slide material taken from: Groth, Han and Kamber, SAS Institute

  2. Overview of this Presentation • Introduction to Data Mining • The SEMMA Methodology • Regression/Logistic Regression • Decision Trees • SAS EM Demo: The Home Equity Loan Case • Important DM techniques Not Covered today: • Market Basket Analysis • Memory-Based Reasoning • Web Link Analysis

  3. The UNT/SAS® joint Data Mining Certificate • Requires: • DSCI 2710 • DSCI 3710 • BCIS 4660 • DSCI 4520 SAMPLE

  4. Introduction to DM “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” (Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia")

  5. What Is Data Mining? • Data mining (knowledge discovery in databases): • A process of identifying hidden patterns and relationships within data (Groth) • Data mining: • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases

  6. Data hospital patient registries electronic point-of-sale data remote sensing images tax returns stock trades OLTP telephone calls airline reservations credit card charges catalog orders bank transactions

  7. Multidisciplinary Statistics Pattern Recognition Neurocomputing Machine Learning AI Data Mining Databases KDD

  8. Data Mining: A KDD Process Knowledge • Data mining: the core of knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

  9. Data Mining and Business Intelligence Increasing potential to support business decisions End User (Manager) Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

  10. Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

  11. IntroducingSAS Enterprise Miner (EM)

  12. The SEMMA Methodology • Introduced By SAS Institute • Implemented in SAS Enterprise Miner (EM) • Organizes a DM effort into 5 activity groups: Sample Explore Modify Model Assess

  13. Input Data Source Sampling Data Partition Sample

  14. Distribution Explorer Multiplot Insight Association Variable Selection Link Analysis Explore

  15. Data Set Attributes Transform Variables Filter Outliers Replacement Clustering Self-Organized Maps Kohonen Networks Time Series Modify

  16. Regression Tree Neural Network Princomp/ Dmneural User Defined Model Ensemble Memory Based Reasoning Two-Stage Model Model

  17. Assessment Reporter Assess

  18. Group Processing Data Mining Database SAS Code Control Point Subdiagram Score C*Score Other Types of Nodes – Scoring Nodes, Utility Nodes

  19. DATA MINING AT WORK:Detecting Credit Card Fraud • Credit card companies want to find a way to monitor new transactions and detect those made on stolen credit cards. Their goal is to detect the fraud while it is taking place. • In a few weeks after each transaction they will know which of the transactions were fraudulent and which were not, and they can then use this data to validate their fraud detection and prediction scheme.

  20. DATA MINING AT WORK:Strategic Pricing Solutions at MCI MCI now has a solution for making strategic pricing decisions, driving effective network analysis, enhancing segment reporting and creating data for sales leader compensation. Before implementing SAS, the process of inventorying MCI's thousands of network platforms and IT systems – determining what each one does, who runs them, how they help business and which products they support – was completely manual. The model created with SAS has helped MCI to catalog all that information and map the details to products, customer segments and business processes. "That's something everyone is excited about," says Leslie Mote, director of MCI corporate business analysis. "Looking at the cost of a system and what it relates to helps you see the revenue you're generating from particular products or customers. I can see what I'm doing better."

  21. Our own example:The Home Equity Loan Case • HMEQ Overview • Determine who should be approved for a home equity loan. • The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. • The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.

  22. HMEQ case overview • The consumer credit department of a bank wants to automate the decision-making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections). • The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.

  23. The HMEQ Loan process • An applicant comes forward with a specific property and a reason for the loan (Home-Improvement, Debt-Consolidation) • Background info related to job and credit history is collected • The loan gets approved or rejected • Upon approval, the Applicant becomes a Customer • Information related to how the loan is serviced is maintained, including the Status of the loan (Current, Delinquent, Defaulted, Paid-Off)

  24. Loan Status Balance Reason MonthlyPayment Approval Date Applies for HMEQ Loan on… using… APPLICANT PROPERTY becomes OFFICER ACCOUNT CUSTOMER has HISTORY The HMEQ LoanTransactional Database • Entity Relationship Diagram (ERD), Logical Design:

  25. HMEQLoanApplication Applicant Property Account History OFFICERID APPLICANTID PROPERTYID LOAN REASON DATE APPROVAL APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ PROPERTYID ADDRESS VALUE MORTDUE ACCOUNTID CUSTOMERID PROPERTYID ADDRESS BALANCE MONTHLYPAYMENT STATUS HISTORYID ACCOUNTID PAYMENT DATE Customer Officer CUSTOMERID APPLICANTID NAME ADDRESS OFFICERID OFFICERNAME PHONE FAX HMEQ Transactional database:the relations • Entity Relationship Diagram (ERD), Physical Design:

  26. The HMEQ LoanData Warehouse Design • We have some slowly changing attributes: HMEQLoanApplication: Loan, Reason, Date Applicant: Job and Credit Score related attributes Property: Value, Mortgage, Balance • An applicant may reapply for a loan, then some of these attributes may have changed. • Need to introduce “Key” attributes and make them primary keys

  27. The HMEQ LoanData Warehouse Design STAR 1 – Loan Application facts • Fact Table: HMEQApplicationFact • Dimensions: Applicant, Property, Officer, Time STAR 2 – Loan Payment facts • Fact Table: HMEQPaymentFact • Dimensions: Customer, Property, Account, Time

  28. HMEQApplicationFact Applicant Property APPLICANTKEY PROPERTYKEY OFFICERKEY TIMEKEY LOAN REASON APPROVAL APPLICANTKEY APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ PROPERTYKEY PROPERTYID ADDRESS VALUE MORTDUE HMEQPaymentFact Account Time Customer Officer CUSTOMERKEY PROPERTYKEY ACCOUNTKEY TIMEKEY BALANCE PAYMENT STATUS ACCOUNTKEY LOAN MATURITYDATE MONTHLYPAYMENT OFFICERKEY OFFICERID OFFICERNAME PHONE FAX CUSTOMERKEY CUSTOMERID APPLICANTID NAME ADDRESS TIMEKEY DATE MONTH YEAR Two Star Schemas for HMEQ Loans

  29. The HMEQ Loan DW:Questions asked by management • How many applications were filed each month during the last year? What percentage of them were approved each month? • How has the monthly average loan amount been fluctuating during the last year? Is there a trend? • Which customers were delinquent in their loan payment during the month of September? • How many loans have defaulted each month during the last year? Is there an increasing or decreasing trend? • How many defaulting loans were approved last year by each loan officer? Who are the officers with the largest number of defaulting loans?

  30. The HMEQ Loan DW:Some more involved questions • Are there any patterns suggesting which applicants are more likely to default on their loan after it is approved? • Can we relate loan defaults to applicant job and credit history? Can we estimate probabilities to default based on applicant attributes at the time of application? Are there applicant segments with higher probability? • Can we look at relevant data and build a predictive model that will estimate such probability to default on the HMEQ loan? If we make such a model part of our business policy, can we decrease the percentage of loans that eventually default by applying more stringent loan approval criteria?

  31. Property Applicant PROPERTYKEY PROPERTYID ADDRESS VALUE MORTDUE APPLICANTKEY APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ HMEQPaymentFact HMEQApplicationFact CUSTOMERKEY PROPERTYKEY ACCOUNTKEY TIMEKEY BALANCE PAYMENT STATUS APPLICANTKEY PROPERTYKEY OFFICERKEY TIMEKEY LOAN REASON APPROVAL Officer Account Customer OFFICERKEY OFFICERID OFFICERNAME PHONE FAX CUSTOMERKEY CUSTOMERID APPLICANTID NAME ADDRESS ACCOUNTKEY LOAN MATURITYDATE MONTHLYPAYMENT Time TIMEKEY DATE MONTH YEAR Selecting Task-relevant attributes

  32. HMEQ final task-relevant data file

  33. HMEQ: Modeling Goal • The credit scoring model should compute the probability of a given loan applicant to default on loan repayment. A threshold is to be selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection. • Using the HMEQ task-relevant data file, three competing models will be built: A logistic Regression model, a Decision Tree, and a Neural Network • Model assessment will allow us to select the best of the three alternative models

  34. Predictive Modeling Inputs Target ... ... ... ... ... ... Cases ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  35. Modeling Tools Logistic Regression

  36. Modeling Techniques: Separate Sampling • Benefits: • Helps detect rare target levels • Speeds processing • Risks: • Biases predictions (correctable) • Increases prediction variability

  37. ( ) 1 - p logit(p ) log p p w0 + w1x1 +…+ wpxp g-1( ) logit(p) 1.0 p 0.5 0.0 0 Logistic Regression Models log(odds) = Training Data

  38. ( ) ( ) ( ) 1 - p 1 - p 1 - p log log log p p p w0 + w1x1 +…+ wpxp ´ = w0 + w1(x1+1)+…+ wpxp w1+w0+w1x1+…+wpxp exp(w1) ´ Changing the Odds = odds ratio Training Data

  39. Modeling Tools Decision Trees

  40. n = 5,000 10% BAD yes no Debt-to-Income Ratio < 45 n = 3,350 n = 1,650 5% BAD 21% BAD Divide and Conquer the HMEQ data The tree is fitted to the data by recursive partitioning. Partitioning refers to segmenting the data into subgroups that are as homogeneous as possible with respect to the target. In this case, the binary split (Debt-to-Income Ratio < 45) was chosen. The 5,000 cases were split into two groups, one with a 5% BAD rate and the other with a 21% BAD rate. The method is recursive because each subgroup results from splitting a subgroup from a previous split. Thus, the 3,350 cases in the left child node and the 1,650 cases in the right child node are split again in similar fashion.

  41. The Cultivation of Trees • Split Search • Which splits are to be considered? • Splitting Criterion • Which split is best? • Stopping Rule • When should the splitting stop? • Pruning Rule • Should some branches be lopped off?

  42. Possible Splits to Consider:an enormous number 500,000 Nominal Input 400,000 Ordinal Input 300,000 200,000 100,000 1 2 4 6 8 10 12 14 16 18 20 Input Levels

  43. Splitting Criteria How is the best split determined? In some situations, the worth of a split is obvious. If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless! In contrast, if a split results in pure child nodes, the split is undisputedly best. For classification trees, the three most widely used splitting criteria are based on the Pearson chi-squared test, the Gini index, and entropy. All three measure the difference in class distributions across the child nodes. The three methods usually give similar results.

  44. Benefits of Trees • Interpretability • tree-structured presentation • Mixed Measurement Scales • nominal, ordinal, interval • Robustness (tolerance to noise) • Handling of Missing Values • Regression trees, Consolidation trees

  45. Summary • Data Mining provides new opportunities to discover previously unknown relationships • SAS EM employs the SEMMA Methodology to extract “knowledge” • Alternative approaches for KD include: • Regression/Logistic Regression • Decision Trees

More Related