1 / 81

An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 20

An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 2000. Today’s talk: An introduction to data mining General concepts Focus on current practice of data mining: main message is be aware of the “hype factor”

ashton
Download Presentation

An Introduction to Data Mining Padhraic Smyth Information and Computer Science University of California, Irvine July 20

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Data Mining Padhraic SmythInformation and Computer ScienceUniversity of California, IrvineJuly 2000

  2. Today’s talk: An introduction to data mining General concepts Focus on current practice of data mining: main message is be aware of the “hype factor” Wednesday’s talk: Application of ideas in data mining to problems in atmospheric/environmental science

  3. Outline of Today’s Talk • What is Data Mining? • Computer Science and Statistics: a Brief History • Models and Algorithms • Hot Topics in Data Mining • Conclusions

  4. The Data Revolution • Context • “.. drowning in data, but starving for knowledge” • Ubiquitous in business, science, medicine, military • Analyzing/exploring data manually becomes difficult with massive data sets • Viewpoint: data as a resource • Data themselves are not of direct use • How can we leverage data to make better decisions ?

  5. Technology is a Driving Factor • Larger, cheaper memory • Moore’s law for magnetic disk density “capacity doubles every 18 months” (Jim Gray, Microsoft) • storage cost per byte falling rapidly • Faster, cheaper processors • can analyze more data • fit more complex models • invoke massive search techniques • more powerful visualization

  6. 1 2 . . . . . . . . . . . d Massive Data Sets 1 2 . . . . N • Characteristics • very large N (billions) • very large d (thousands or millions) • heterogeneous • dynamic • (Note: in scientific applications there is often a temporal and/or spatial dimension)

  7. Hypercube in d dimensions Hypersphere in d dimensions Dimension 2 3 4 5 6 7 Rel. Volume 0.79 ? ? ? ? ? High-dimensional data (David Scott, Multivariate Density Estimation, Wiley, 1992) • Volume of sphere relative to cube in d dimensions?

  8. Dimension 2 3 4 5 6 7 Rel. Volume 0.79 0.53 0.31 0.16 0.08 0.04 High-dimensional data Hypercube in d dimensions Hypersphere in d dimensions • high-d, uniform => most data points will be “out” at the corners • high-d space is sparse: and non-intuitive

  9. What is data mining?

  10. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets”

  11. What is data mining? “The magic phrase to put in every funding proposal you write to NSF, DARPA, NASA, etc”

  12. What is data mining? “The magic phrase you use to sell your….. - database software - statistical analysis software - parallel computing hardware - consulting services”

  13. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Statistics, Inference

  14. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Statistics, Inference Languages and Representations

  15. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Engineering, Data Management Statistics, Inference Languages, Representations

  16. What is data mining? “Data-driven discovery of models and patterns from massive observational data sets” Engineering, Data Management Languages, Representations Statistics, Inference Applications

  17. Who is involved in Data Mining? • Business Applications • customer-based, transaction-oriented applications • very specific applications in fraud, marketing, credit-scoring • in-house applications (e.g., AT&T, Microsoft, etc) • consulting firms: considerable hype factor! • largely involve the application of existing statistical ideas, scaled up to massive data sets (“engineering”) • Academic Researchers • mainly in computer science • extensions of existing ideas, significant “bandwagon effect” • largely focused on prediction with multivariate data • Bottom Line: • primarily computer scientists, often with little knowledge of statistics, main focus is on algorithms

  18. Myths and Legends in Data Mining • “Data analysis can be fully automated” • human judgement is critical in almost all applications • “semi-automation” is however very useful

  19. Myths and Legends in Data Mining • “Data analysis can be fully automated” • human judgement is critical in almost all applications • “semi-automation” is however very useful • “Association rules are useful” • association rules are essentially lists of correlations • no documented successful application • compare with decision trees (numerous applications)

  20. Myths and Legends in Data Mining • “Data analysis can be fully automated” • human judgement is critical in almost all applications • “semi-automation” is however very useful • “Association rules are useful” • association rules are essentially lists of correlations • no documented successful application • compare with decision trees (numerous applications) • “With massive data sets you don’t need statistics” • massiveness brings heterogeneity - even more statistics

  21. Current Data Mining Software 1. General purpose tools • software systems for data mining (IBM, SGI, etc) • just simple statistical algorithms with SQL? • limited support for temporal, spatial data • some successes (difficult to validate) • banking, marketing, retail • mainly useful for large-scale EDA? • “mining the miners” (Jerry Friedman): • similar to expert systems/neural networks hype in 80’s?

  22. Items x x x x x x x x Transactions x x x x x x Transaction Data and Association Rules x x x • Supermarket example: (Srikant and Agrawal, 1997) • #items = 500,000, #transactions = 1.5 million x x x

  23. Items x x x x x x x x Transactions x x x x x x Transaction Data and Association Rules x x x • Example of an Association RuleIf a customer buys beer they will also buy chips • p(chips|beer) = “confidence” • p(beer) = “support” x x x

  24. Current Data Mining Software • 2. Special purpose (“niche”) applications • fraud detection, direct-mail marketing, credit-scoring,etc. • often solve high-dimensional classification/regression problems • Telephone industry applications • fraud • Direct-mail advertising • find new customers • increase # home-equity loans • common theme: “track the customer!” • difficult to validate claims of success (few publications)

  25. Advanced Scout • Background • every NBA game is annotated (each pass, shot, foul, etc.) • potential competitive advantage for coaches • Problem: over a season, this generates alot of data! • Solution (Bhandari et al, IBM, 1997) • “attribute focusing” finds conditional ranges on attributes where the distributions differ from the norm • generates descriptions of interesting patterns e.g., “Player X made 100% of his shots when when Player Y was in the game: X normally makes only 50% of his shots” • Status • used by 28 of the 29 teams in the NBA • an intelligent assistant

  26. AT&T Classification of Telephone Numbers • Background • AT&T has about 100 million customers • It logs 300 million calls per day, 40 attributes each • 350 million unique telephone numbers • Which are business and which are residential? • Solution (Pregibon and Cortes, AT&T,1997) • Proprietary model, using a few attributes, trained on known business customers to adaptively track p(business|data) • Significant systems engineering: data are downloaded nightly, model updated (20 processors, 6Gb RAM, terabyte disk farm) • Status: • invaluable evolving “snapshot” of phone usage in US for AT&T • basis for fraud detection, marketing, and other applications

  27. Bad Debt Prediction • Background • Bank has 120,000 accounts which are delinquent • employs 500 collectors • process is expensive and inefficient • Predictive Modeling • target variable: amount repaid within 6 months • input variables: 2000 different variables derived from credit history • model outputs are used to “score” each debtor based on likelihood of paying • Results • decision trees, “bump-hunting” used to score customers • non-trivial software issues in handling such large data sets • “scoring” system in routine use • estimated savings to bank are in millions/annum

  28. Outline • What is Data Mining? • Computer Science and Statistics: a Brief History

  29. Historical Context: Statistics • Gauss, Fisher, and all that • least-squares, maximum likelihood • development of fundamental principles • The Mathematical Era • 1950’s: Neyman, etc: the mathematicians take over • The Computational Era • steadily growing since the 1960’s • note: “data mining/fishing” viewed very negatively! • 1970’s: EDA, Bayesian estimation, flexible models, EM, etc • a growing awarness of the power and role of computing in data analysis

  30. Historical Context: Computer Science • Pattern Recognition and AI • focus on perceptual problems (e.g., speech, images) • 1960’s: bifurcation into statistical and non-statistical approaches, e.g., grammars • convergence of applied statistics and engineering • e.g., statistical image analysis: Geman, Grenander, etc • Machine Learning and Neural Networks • 1980’s: failure of non-statistical learning approaches • emergence of flexible models (trees, networks) • convergence of applied statistics and learning • e.g., work of Friedman, Spiegelhalter, Jordan, Hinton

  31. The Emergence of Data Mining • Distinct threads of evolution • AI/machine learning • 1989 KDD workshop -> ACM SIGKDD 2000 • focus on “automated discovery, novelty” • Database Research • focus on massive data sets • e.g., SIGMOD -> association rules, scalable algorithms • “Data Owners” • what can we do with all this data in our RDBMS? • primarily customer-oriented transaction data owners • industry dominated, applications-oriented

  32. The Emergence of Data Mining • The “Mother in Law” phenomenon • even your mother-in-law has heard about data mining • Beware of the hype! • remember expert systems, neural nets, etc • basically sound ideas that were oversold creating a backlash

  33. Statistics Computer Science

  34. Statistics Computer Science Statistical Inference Statistical Pattern Recognition Neural Networks Machine Learning Data Mining Databases

  35. Where Work is Published Statistics Computer Science Statistical Inference Statistical Pattern Recognition Neural Networks Machine Learning Data Mining Databases ICML COLT ML Journal KDD IJDMKD SIGMOD VLDB JASA, JRSS IEEE PAMI ICPR ICCV NIPS Neural Comp.

  36. Focus Areas Statistics Computer Science Statistical Inference Statistical Pattern Recognition Neural Networks Machine Learning Data Mining Databases Nonlinear Regression Flexible Classification Models Pattern Finding Computer Vision, Signal Recognition Scalable Algorithms Graphical Models Hidden Variable Models

  37. General Characteristics Nonlinear Regression Flexible Classification Models Pattern Finding Computer Vision, Signal Recognition Graphical Models Scalable Algorithms Hidden Variable Models More Statistical More Algorithmic

  38. General Characteristics Nonlinear Regression Flexible Classification Models Pattern Finding Computer Vision, Signal Recognition Graphical Models Scalable Algorithms Hidden Variable Models More Statistical More Algorithmic Continuous Signals Categorical Data

  39. General Characteristics Nonlinear Regression Flexible Classification Models Pattern Finding Computer Vision, Signal Recognition Graphical Models Scalable Algorithms Hidden Variable Models More Statistical More Algorithmic Continuous Signals Categorical Data Model-Based “Model-free”

  40. General Characteristics Nonlinear Regression Flexible Classification Models Pattern Finding Computer Vision, Signal Recognition Graphical Models Scalable Algorithms Hidden Variable Models More Statistical More Algorithmic Continuous Signals Categorical Data Model-Based “Model-free” Time/Space Modeling Multivariate Data

  41. “Hot Topics” Nonlinear Regression Flexible Classification Models Pattern Finding Computer Vision, Signal Recognition Hidden Variable Models Scalable Algorithms Graphical Models Classification Trees Belief Networks Deformable Templates Mixture/ Factor Models Association Rules Hidden Markov Models Support Vector Machines Model Combining

  42. Implications • The “renaissance data miner” is skilled in: • statistics: theories and principles of inference • modeling: languages and representations for data • optimization and search • algorithm design and data management • The educational problem • is it necessary to know all these areas in depth? • Is it possible? • Do we need a new breed of professionals? • The applications viewpoint: • How does a scientist or business person keep up with all these developments? • How can they choose the best approach for their problem

  43. Outline • What is Data Mining? • Computer Science and Statistics: a Brief History • Models and Algorithms

  44. Data Set E.g., multivariate, continuous/categorical, temporal, spatial, combinations, etc

  45. Data Set Task E.g., Exploration, Prediction, Clustering, Density Estimation, Pattern Discovery

  46. Data Set Task Model Language/Representation: Underlying functional form used for representation, e.g., linear functions, hierarchies, rules/boxes, grammars, etc

  47. Data Set Task Model Score Function Statistical Inference: How well a model fits data, e.g., square-error, likelihood, classification loss, query match, interpretation

  48. Data Set Task Model Score Function Modeling Optimization Computational method used to optimize score function, given the model and score function, e.g., hill-climbing, greedy search, linear programming

  49. Data Set Task Model Score Function Modeling Optimization Actual instantiation as an algorithm with data structures, efficient implementation, etc. Data Access Algorithm

  50. Data Set Task Model Score Function Modeling Optimization Data Access Algorithm Human Evaluation/Decisions

More Related