1 / 19

Data Mining Process Source: CRISP-DM (SPSS.com website)

Data Mining Process Source: CRISP-DM (SPSS.com website). Data Cleaning . MIS Issues (Source: Article by Ralph Kimball) Analyst Issues. MIS Issues . Elementizing (Parsing) Standardizing Verifying Matching, Householding Documenting. Elementising.

sidney
Download Presentation

Data Mining Process Source: CRISP-DM (SPSS.com website)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Process Source: CRISP-DM (SPSS.com website)

  2. Data Cleaning • MIS Issues • (Source: Article by Ralph Kimball) • Analyst Issues

  3. MIS Issues • Elementizing (Parsing) • Standardizing • Verifying • Matching, • Householding • Documenting

  4. Elementising • Ralph B and Julianne Kimball Trustees for Kimball Fred CSte. 11613150 Hiway 9Box 1234 Boulder CrkColo 95006

  5. Addressee First Name(1): RalphAddressee Middle Initial(1): BAddressee Last Name(1): KimballAddressee First Name(2): JulianneAddressee Last Name(2): KimballAddressee Relationship: Trustees forRelationship Person First Name: FredRelationship Person Middle Name: CRelationship Person Last Name: KimballStreet Address Number: 13150Street Name: Hiway 9Suite Number: 116Post Office Box Number: 1234City: Boulder CrkState: ColoFive Digit Zip: 95006

  6. Standardizing • Ste = suite • Hiway 9 = Highway 9 • Other example - • Grade “D” = Distinction in Australia

  7. Verification • Zip code 95006 is CA, not Colorado

  8. Matching/Householding • Match record with other customer records containing Ralph and Julianne Kimball • Establish that they are part of the same household

  9. Analyst Issues • Physical data problems • Data Dictionaries • Validation (Frequencies) • Missing Data • The “zero” value problem • Inappropriate (Future) data for modeling • Unavailable data

  10. Physical • Cannot access data • ASCII vs EBCDIC • On a medium that you can’t use (certain type of tape, for instance)

  11. Data Dictionaries • What are the fields? • Where are they located? • What format are they stored in?

  12. Missing Data • Ignore • Find the right values if you can • Use Average for that variable • Replace with number that matches its characteristics (What do the missing people look like in terms of the dependent? Who else looks like that?

  13. The zero problem • What does 0 mean? • If “Number of Revolving Bankcard Trades Currently Past Due” = 0, what does that mean?

  14. # of Bank Rev. Trds Currently Past Due Cumulative Cumulative BRPSTD Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 13485 96.0 13485 96.0 1 486 3.5 13971 99.5 2 57 0.4 14028 99.9 3 12 0.1 14040 100.0 4 2 0.0 14042 100.0

  15. # of Trds Cumulative Cumulative TRADES Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 1606 11.4 1606 11.4 1 1080 7.7 2686 19.1 2 1056 7.5 3742 26.6 3 1007 7.2 4749 33.8 4 949 6.8 5698 40.6 5 911 6.5 6609 47.1 6 849 6.0 7458 53.1 7 793 5.6 8251 58.8 8 682 4.9 8933 63.6 9 622 4.4 9555 68.0 10+ 4487 32.0 14042 100.0

  16. # of Bank Rev. Trds Cumulative Cumulative BRTRDS Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ INQS. & PR ONLY 64 0.5 64 0.5 PR ONLY 22 0.2 86 0.6 INQS. ONLY 960 6.8 1046 7.4 NO RECORD 560 4.0 1606 11.4 0 6183 44.0 7789 55.5 1 2616 18.6 10405 74.1 2 1427 10.2 11832 84.3 3 831 5.9 12663 90.2 4 496 3.5 13159 93.7 5 287 2.0 13446 95.8 6 188 1.3 13634 97.1 7 142 1.0 13776 98.1 8 92 0.7 13868 98.8 9 60 0.4 13928 99.2 10+ 114 0.8 14042 100.0

  17. # of Bank Rev. Trds Currently Past Due Cumulative Cumulative BRPSTD Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ NO TRADES OF THIS TYPE 6183 44.0 6183 44.0 INQS. & PR ONLY 64 0.5 6247 44.5 PR ONLY 22 0.2 6269 44.6 INQS. ONLY 960 6.8 7229 51.5 NO RECORD 560 4.0 7789 55.5 MISSING 3475 24.7 11264 80.2 0 2221 15.8 13485 96.0 1 486 3.5 13971 99.5 2 57 0.4 14028 99.9 3 12 0.1 14040 100.0 4 2 0.0 14042 100.0

  18. Inappropriate Data Used • Future data used to build great looking model. • Used payments till month end instead of payments until cycle date.

  19. Unavailable Data • Data on Rejected Applicants • Would they have been Good or Bad had they been accepted? • Use “Reject Inferencing” techniques.

More Related