August 12, 2007 KDD-07 Invited Innovation Talk Research Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc.
Thanks and Gratitude • My family: my wife Kristina and my 4 kids; my parents and my sisters • My academic roots:The University of Michigan, Ann Arbor – my Ph.D. committee, including Ramasamy Uthurusamy (then at GM Research Labs), grad student colleagues (Jie Cheng), Internships at GM Research and at NASA’s JPL • My Mentors and Collaborators • Caltech Astronomy (G. Djorgovski, Nick Weir), Pietro Perona and M.C. Burl • JPLNASA Colleagues: Padhraic Smyth, Rich Doyle, Steve Chien, Paul Stolorz, Peter Cheeseman, David Atkinson, many others… • Microsoft Colleagues: Decision Theory Group, Surajit Chadhuri, Jim Gray, Paul Bradley, Bassel Ojjeh, Nick Besbeas, Heikki Mannila, Rick Rashid, many others • Fellows in KDD: Gregpry Piatetsky-Shapiro, Daryl Pregibon, Christos Faloutsos, Geoff Webb, Bob Grossman, Jiawei Han, Eric Tsui, Tharam Dillon, Chengqi Zhang, many, many colleagues • My Business Partners • Bassel Ojjeh, Nick Besbeas, many VC’s, many advisers and strategic clients including Microsoft SQL Server and sales teams • My Yahoo! Colleagues: • Zod Nazem, Jerry Yang, David Filo, Yahoo! exec team, Prabhakar Raghavan, Pavel Berkhin, Nick Weir, Hunter Madsen, Nitin Sharma, Raghu Ramakrishnan, Y! Research folks, many at Yahoo SDS and current and previous Yahoo! employees
Personal Observations of a Data Mining Disciple A Data Miner’s Story – Getting to Know the Grand Challenges Research Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc.
Overview • The setting • Why data mining is a must? • Why data mining is not happening? • A Data Miner’s Story • Grand Challenges: Pragmatic • Grand Challenges: Technical • Some case studies • Concluding Remarks
The data gap… • The Machinery Moves on: • Moore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory • It’s more aggressive cousin: Disk storage “capacity” doubles every 9 months • The Demand is exploding: • Every business is an eBusiness • Scientific Instruments and Moore’s law • Government • The Internet – the ubiquity of the Web • The Talent Shortage
What is Data Mining? Finding interestingstructure in data • Structure: refers to statistical patterns, predictive models, hidden relationships • Interesting: ? • Examples of tasks addressed by Data Mining • Predictive Modeling (classification, regression) • Segmentation (Data Clustering ) • Affinity (Summarization) • relations between fields, associations, visualization
Beyond Data Analysis • Scaling analysis to large databases • How to deal with data without having to move it out? • Are there abstract primitive accesses to the data, in database systems, that can provide mining algorithms with the information to drive the search for patterns? • How do we minimize--or sometimes even avoid--having to scan the large database in its entirety? • Automated search • Enumerate and create numerous hypotheses • Fast search • Useful data reductions • More emphasis on understandable models • Finding patterns and models that are “interesting” or “novel” to users. • Scaling to high-dimensional data and models.
Data Mining and Databases Many interesting analysis queries are difficult to state precisely • Examples: • which records represent fraudulent transactions? • which households are likely to prefer a Ford over a Toyota? • Who’s a good credit risk in my customer DB? • Yet database contains the information • good/bad customer, profitability • did/did not respond to mailout/survey/...
ACME CORP ULTIMATE DATA MINING BROWSER Data Mining Grand Vision What’s New? What’s Interesting? Predict for me
The myths… • Companies have built up some large and impressive data warehouses • Data mining is pervasive nowadays • Large corporations know how to do it • There are tools and applications that discover valuable information in enterprise databases
The truths… • Data is a shambles, • most data mining efforts end up not benefiting from existing data infra-structure • Corporations care a lot about data, and are obsessed with customer behavior and understanding it • They talk a lot about it… • An extremely small number of businesses are successfully mining data • The successful efforts are “one-of”, “lucky strikes”
Current state of Databases Ancient Egypt • Data navigation, exploration, & exploitation technology is fairly primitive: • we know how to build massive data stores • we do not know how to exploit them • we do the book-keeping really well (OLTP) • Inadequate basic understanding of navigation /systems • many large data stores are write-only (= data tomb)
A Data Miner’s Story • Started out in pure research • Professional student • Math and algorithms
Researcher view Algorithms and Theory Database Systems
Practitioner view Systems and integration Customer Database Algorithms
Business view Customer Database Systems $$$’s Algorithms
A Data Miner’s Story • Started out in pure research • At NASA-JPL did basic research and applied techniques to Science Data Analysis problems • Worked with top scientists is several fields: astronomy, planetary geology, atmospherics, space science, remote sensing imagery • Great results, strong group, lots of funding, high demand… • So why move to Microsoft Research?
Data Mining Based Solution • 94% accuracy in recognizing sky objects • Speed up catalog generation by one to two orders of magnitude (unrealistic to perform manually). • Classify objects that are at least one magnitude fainter than catalogs to-date. • Tripled the “data yield” • Generate sky catalogs with much richer content: • on order of billions of objects: > 2x107 galaxies > 2x108 stars, 105 quasars • Discovered new quasars 40 times more efficiently
A Data Miner’s Story • Started out in pure research • At NASA-JPL • At Microsoft Research • Basic research in algorithms and scalability • Began to worry about building products and integrating with database server • Two groups established: research and product • So why move out to a start-up?
Working with Large Databases • One scan (or less) of the database • terminate early if appropriate • Work within confines of a given limited RAM buffer • Cluster a Gigabyte or Terabyte in, say 10 or 100 Megabytes RAM • “Anytime” algorithm • best answer always handy • Pause/resume enabled, incremental • Operate on forward-only cursor over a view (essentially a data stream)
Neural Networks CART Segmentation OLAP Decision Trees Logistic Regressions Genetic Algorithms Bayesian Networks Chaid Business Results Gap Business users are unable to apply the power of existing data mining tools to achieve results Business Challenges Business Challenges Technologies Technical Tools Acquisition Acquisition Conversion Conversion Average Order Average Order Retention Retention Loyalty Loyalty
Specialists Statisticians Neural Networks CART Data Mining PhDs Segmentation OLAP Decision Trees DBAs Logistic Regressions Consultants Genetic Algorithms Bayesian Networks Chaid Business Results Gap Business users are unable to apply the power of existing data mining tools to achieve results Business Challenges Business Challenges Technologies Technical Tools Acquisition Acquisition Conversion Conversion Average Order Average Order Retention Retention Loyalty Loyalty
Evolving Data Mining • Evolution on the technical front: • New algorithms • Embedded applications • Make the analyst life easier • Evolution on the usability front • New metaphors • Vertical applications embedding • Used by the business user • In both cases, success means invisibility…
Grand Challenges • Pragmatic: • Achieving integration and invisibility • Research/Technical: • Solving some serious unaddressed problems
Pragmatic Grand Challenge 1 Where is the data? • There is a glut of stored data • Very little of that data is ready for mining • Data warehousing has proven that it will not solve the problem for us • Solution: • integration with operational systems • Take a serious database approach to solving the storage management problem
digiMine Background Started as Venture Capital-funded company: digiMine, Inc. in March 2000. Built, operated and hosted data warehouses with built-in data mining apps • Headquartered in Bellevue, Washington • $45 million in funding – Mayfield, Mohr Davidow, American Express, Deutsche Bank • Grew to over 120 employees • 50 patents+ in technology and processes • Both technology and services
A Data Miner’s Story • Started out in pure research • At NASA-JPL • At Microsoft Research • At digiMine • Lots of VC funding, great team, great press coverage, and fast moving • great customers • So why move to a DMX Group?
Why DMX Group? • At digiMine, we grew a large “Professional Services” organization • We learned a lot from these engagements • VC-funded companies cannot do much consulting • A fork in the road appeared… • digiMine re-focused on a market vertical: behavioral targeting for media and publishers • Renamed to Revenue Science, Inc. • Formed DMX Group… which was eventually acquired by Yahoo!
DMX Group Mission • Make enterprise data a working asset in the enterprise: • Data strategy for the business • Implementation of Business Intelligence and data mining capabilities • Business issues around data • What is possible? • How to expose it to business users • How to train people and change processes • Integration with operational systems
Data Strategy • How can your data influence your revenues? • How do you optimize operations based on data? • How do you increase customer retention based on data? • How do you utilize enterprise data assets to spot new opportunities: • Cross-sell to existing customers • Grow new markets • Avoid problems such as fraud, abuse, churn, etc?
A Data Miner’s Story • Started out in pure research • At NASA-JPL • At Microsoft Research • At digiMine/Revenue Science Inc. • At DMX Group…
Pragmatic Grand Challenge 2 Embedding within Operational Systems • We all worry about algorithms, they are fascinating • Most of us know that data mining in practice is mostly data prep work • Go where the data is when the data does not come to you • But how much of the problem is “data mining”? • facts: • The effort in embedding an application is huge, and often not discussed • Without it, all the algorithms are useless
Churn Modelling and Prediction Case Study – Wireless Telco Research
Assign Customer Value Build Churn Model Sample Database Score Database High Val Med Risk High Val High Risk High Val Low Risk Med Val Med Risk Med Val High Risk Med Val Low Risk Med Val Low Val Med Risk Low Risk High Risk High Val Low Val Med Risk Low Val High Risk Low Val Low Risk Modeling Process 6 2 4 3 6 5 Risk 1 Customer Interaction Base Value CDR Billing SMS WAP
LTV and Its Application • A customer’s life-time value (LTV) is the net value that a customer brings in to a business by the end of their service. I.e. their profit contribution. • LTV allows decisions for individual customers that optimize the return-on-investment (ROI). Examples: • Aggressive retention programs, such as equipment upgrade and contract renewal for high LTV. • Differentiated customer care treatment for reactivations by customer with low LTV
What is the Required? • Detailed data • Integration of CDR, WIG, SMS, Billing • Maintained at detailed level • Integrated data mining • Algorithms tuned to model thousands of variables and millions of rows • Accurate Forecasts • System Robustness • Massively scalable back end system • Flexible architecture to create new variables quickly and easily • Collaborative Service Model • Service model which guarantees success • Combined IQ Model to optimize science and business knowledge • Low cost to create and maintain models
Map Segments to Actions High Save Program Cautiously Defend Aggressively Defend Let them go Contract Renewal Equipment Upgrade Cost Reducing Programs Feature Add Elite Program Churn Probability Change Bad Behavior Grow Margin Nurture / Maintain Plan Migration Loyalty Programs Feature Use Low Low Forecasted LTV High Negative
Cost Rules Applied… Cost Rules are introduced to define scoring For Example: • Network System Usage Cost • Mobile to Land Connections Costs • Technical Operations/Support Costs • Long Distance Costs • Inter-Carrier /International subsidy costs • Roaming Costs • Bad Debt Allocation • Many others…
Cost Rules for a Bank? Cost Rules are introduced to define value For Example: • Deposit Value • Product mix • Average. daily balance • Monthly service fees • Technical operations/Support costs • Branch/teller usage • Late payment/Overdraft history • Interest rate • Contract term • Credit Score • Employment history/Income
Pragmatic Grand Challenge 3 Integrating domain knowledge • Data mining algorithms are knowledge free • There is no notion of “common sense reasoning” • Do we have to solve an AI-hard problem? • Robust and deep domain knowledge utilization • solution: • Very deep and very narrow integration • Ability to “model” business strategy • Reasoning capability just evolves (c.f. chess players)
Customer looking for pants Complete the Assortment Any Related Products Help Me Decide Recommendations Collaborative Filtering Impulse Buy Alternates Up Sells Complement Add-on Context Sensitive Approach Cross-Sell / Up-Sell Example
Pragmatic Grand Challenge 4 Managing and maintaining models • When was the last time you thought about the lifetime of a mining model • What happens when a model is changed • Have you tried to merge the results of two different clustering models over time? • How many “data droppings” (aka temp files, quick transformations, quick fixes) do you generate in an analysis session? • A framework for managing, updating, and retiring mining models • solution: use techniques that have been invented for this, databases, systems mngmt, s/w engr, etc…
Pragmatic Grand Challenge 5 Effectiveness Measurement • How do we measure [honestly] the effectiveness of a model in a context? • Return on Investment (ROI) measurement • Evaluation in the context of the application • A framework and methodology for measurement and evaluation • Build the measurement method as part of the design of the model • An engineering recipe for measurements, and a set of metrics
Technical Challenges Research
Technical Challenges 0. Public benchmark data sets • As a field we have failed to define a common data collection • Very difficult to judge research and systems advances • Not an easy task, but not impossible • A mix of • synthetic (but realistic) data sets • and real datasets
Technical Challenges 1. How does the data grow? • A theory for how large data sets get to be large • Definitely not IID sampling from a static distribution • Inappropriateness of a “single-population” model • 2. Complexity/understandability tradeoff • Explaining how, when and why a model works • Explaining when a model fails • A “Tuning Dial” for reducing the complex into the understandable
Technical Challenges 3. Interestingness • What is an “interesting” pattern or summary? • How do you measure “novelty”? • What is “unusual”? When is it worthy of attention? • Is it low probability events? High summarization ability? Outliers? Good fits? Bad fits?