1 / 28

Data Mining At Tech Journal

Data Mining At Tech Journal. Agenda. Background. Questions of Interest. Data Overview. Selected Approach. Potential Issues. Current Status. First Results. Agenda. Background. Questions of Interest. Data Overview. Selected Approach. Potential Issues. Current Status. First Results.

Download Presentation

Data Mining At Tech Journal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining At Tech Journal

  2. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results

  3. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results

  4. A US company (“TechJournal”) publishes an on-line journal (“TechPub”) with content specifically aimed at IT professionals TechJournal is 15 years old; TechPub is 5 years old Content for TechPub comes from three sources: Aggregated content from public sources TechJournal created content Peer contributed content TechJournal core business is to produce a high-end list product for the marketing departments of IT manufacturers The Company

  5. The content on the publication website is available to both anonymous and registered users Registered users get access to some premium services as well Most content is free. Some whitepapers for sale. Three very unique features of the site Peer contributed content Auction system -> readers to get paid to contribute content New: personalized content for each reader The Journal

  6. Target: IT Professional involved in their organization’s technology purchasing decision Different levels of “readership”: The company continuously tries to stimulate new readership through e-mail campaigns The Readers Number of Individuals E Mail Recipients Anonymous Visits E Mail Recipients Visited Site E Mail Recipients Repeat Visitor Registered Light Reader Registered Heavy Reader

  7. The Business Model “Active Readers Produce Better Lists” Loop “Success Breeds Success” Loop “Known Readers Make For Better Journal” Loop “Buzz Marketing” Loop

  8. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results

  9. “Active Readers Produce Better Lists” Loop “Success Breeds Success” Loop “Known Readers Make For Better Journal” Loop Focal Areas For Data Mining • Given email recipient attributes, what is the likelihood of a visit to website? • Which content headlines would maximize that visit likelihood? • Given registered readers’ attributes, which stories will they be interested in? • Given past stories read, what is a registered reader most likely to also read? • Given registered readers’ attributes, which will be most active? • Is TechJournal’s current content taxonomy effective or • would some content taxonomy be more useful?

  10. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results

  11. The Data My “Chunk of Data” to Mine: An Issues Table 713,110 records Issues - Content Linker Table 2,185,664 records Content Items Table 590 records Page Visit Table 43,580 records Recipients Table 195,455 records Taxonomy Click Table 9,385 records

  12. Attributes to Work With = Features that can be utilized directly or derived from for Classification

  13. Level Classes 1 1 2 5 3 46 . . . . 4 798 1909 5 . . 21 5000 + Creating Content Classes • TechJournal’s current taxonomy for classifying content: • Manually derived • Aggregation of other credible taxonomy fragments • From a content provider point of view • Goes out to 21 levels in some cases, others as shallow as three 9,750 Visits spread over 31 Classes

  14. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status Preliminary Results

  15. A Variety of Approaches PREDICTIVE MODELING • Given email recipient attributes, what is the likelihood of a visit to website? • Which content headline would maximize that visit likelihood? • Given registered readers attributes, which readers will be most active? • Given registered reader attributes, which types of content will they read? CLUSTER ANALYSIS • Is TechJournal’s current content taxonomy effective or would some other taxonomy be more useful? ASSOCIATION ANALYSIS • Given past stories read, what is a registered reader most likely to also read?

  16. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results

  17. Potential Issues • Database evolution produces noisy, dirty, unevenly populated data • Data comes from multiple sources, producing consistent data has been a challenge • Still not clear if we will end up with enough data to see anything meaningful • Content taxonomy is relatively new; most likely has real problems with how its structured • Taxomony measures article subject matter, but behavior stimulating content may be in headlines • Features are somewhat related: • Features have high number of discrete values – need to be put into meaningful groupings • Under-representation of several feature and class values

  18. Feature Grouping - Location 10 7 1 5 2 6 3 Other 11 4 9 8

  19. Owner Chairman/CEO Manager of Managers Result: 24 Categories 1 Manager of Doer Functional Area 1 Functional Area N 2,20 - 29 Functional Area 1 Functional Area 10 3,30 - 39 Doer 4 Assistant Assistant Feature Grouping - Title • Start with ~ 1000 distinct self-reported Titles in the Database • Most interested in Title as it correlates with impact, influence on IT buying decisions • Reclassify them based on three concepts: Senority, Function, Employees in Company

  20. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results

  21. Where I Am In The Process Problem Definition Data Gathering Data Prep Data Mining Results Analysis Visualiz. Sum Up Insights

  22. Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results

  23. 0.1429 n = 7 0.7037 n = 27 First Results Q: Given registered readers attributes, which readers will be most active? Method: Decision Tree Induction – Training Set 599 Records, Test Set 187 Records MSE on Training Set = .1313 MSE on Test Set = .1451

  24. First Results Q: Given the attributes of a registered reader, which content types they will read? Method: Decision Tree Induction n= 786 node), split, n, deviance, yval * denotes terminal node 1) root 786 223508.000 29.44402 2) LocGrpID< 1.5 96 23784.990 24.01042 4) RIC>=70.5 53 10433.890 19.66038 * 5) RIC< 70.5 43 11112.050 29.37209 10) RIC< 66 33 8432.545 25.27273 * 11) RIC>=66 10 294.900 42.90000 * 3) LocGrpID>=1.5 690 196494.400 30.20000 6) RIC< 71.5 438 127844.900 28.34475 12) RIC>=14.5 411 120569.000 27.69586 * 13) RIC< 14.5 27 4468.667 38.22222 * 7) RIC>=71.5 252 64521.570 33.42460 14) Title_Code>=38 20 4712.950 20.45000 * 15) Title_Code< 38 232 56151.570 34.54310 * 20.45 n = 20 35.54 n = 232

  25. First Results Q: Given registered reader attributes, which types of content will they read? Method: Kernel SVM with Gaussian Kernel Overall Training Error = .569975

  26. Defining Project Success • Success for this project could come in different forms: • Insights gained on any of the six questions within • the project’s scope; • - and/or – • Insight into how TechJournal should modify its • data capture policies to facilitate data mining for the • answers to these questions in the future

  27. Questions/Comments

More Related