data mining course n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining Course PowerPoint Presentation
Download Presentation
Data Mining Course

play fullscreen
1 / 73

Data Mining Course

316 Views Download Presentation
Download Presentation

Data Mining Course

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data Mining Course Chapter 1Introduction to Data Mining Data Mining Course, Sharif University of Technology

  2. Introduction to Data Mining • Examples of Data Mining • Bank of America • 13 million contact Bank of America’s call center each month • In past, customers listened to same marketing message • Whether relevant to customer or not • Kelly, VP Database Marketing states, “...we want to be as relevant as possible to each customer” • Customer profiles available to service representatives • May suggest applicable products or services • Data mining helps identify marketing approach based on customer’s profile Data Mining Course, Sharif University of Technology

  3. Introduction to Data Mining (cont’d) • Homeland Security • Shortly after 9/11/2001 events, FBI announced identification of five terrorists in consumer database records • One had 30 credit cards with $250,000 debt • Another had 12 different addresses • Former President Clinton concluded data should be proactively searched • Clinton said, “...they have 12 homes, they’re either really rich or up to no good...shouldn’t be hard to figure out which.” Data Mining Course, Sharif University of Technology

  4. Introduction to Data Mining (cont’d) • Gene Expression Database • In children, brain tumors represent deadly form of cancer • 3,000 cases diagnosed per year • Children’s Memorial Hospital building gene expression database • Goal is developing more effective treatment • Bremer, Director of Brain Research, uses Clementine as initial step in tumor identification • Classification identifies one of 12 different tumor types Data Mining Course, Sharif University of Technology

  5. Trends leading to Data Flood • More data is generated: • Bank, telecom, other business transactions ... • Scientific data: astronomy, biology, etc • Web, text, and e-commerce Data Mining Course, Sharif University of Technology

  6. Big data examples • Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session • storage and analysis a big problem • AT&T handles billions of calls per day • so much data, it cannot be all stored -- analysis has to be done “on the fly”, on streaming data Data Mining Course, Sharif University of Technology

  7. Largest databases in 2003 • Commercial databases: • Winter Corp. 2003 Survey: France Telecom has largest decision-support DB, ~30TB; AT&T ~ 26 TB • Web • Alexa internet archive: 7 years of data, 500 TB • Google searches 4+ Billion pages, many hundreds TB • IBM WebFountain, 160 TB (2003) • Internet Archive (www.archive.org),~ 300 TB Data Mining Course, Sharif University of Technology

  8. Data growth rate • Twice as much information was created in 2002 as in 1999 (~30% growth rate) • Other growth rate estimates even higher • Very little data will ever be looked at by a human • Knowledge Discovery is NEEDED to make sense and use of data. Data Mining Course, Sharif University of Technology

  9. What is Data Mining? • “…the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data…” (Gartner Group) • “…the analysis of observational data sets to find unsuspected relationships and to summarize data in novel ways…” (Hand et al.) • “Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data”, (Han et al.) • “…is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization…” (Cabana et al.) • Alternative names: • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Data Mining Course, Sharif University of Technology

  10. What is Data Mining? (cont’d) • Data mining chosen as one of top 10 emerging technologies (MIT Technology Review) • “Data mining expertise is most sought after...” (1999 Information Week Survey) • Brown, from BridgeGate LLC said, “Many companies have implemented a data warehouse...starting to look at what they can do with all that data” Data Mining Course, Sharif University of Technology

  11. What is Data Mining? (cont’d) • How widespread is data mining? • Boston Celtics listed employment position in 12/2003 • Statistics Intern: Work with Basketball Operations • “Responsibilities include: ...data mining, etc.” • New York Nicks already using IBM’s Advanced Scout data mining software • Software includes NBA’s game data in form of “events” • Each game includes statistics such as shots, passes, points, rebounds, etc. • Against Chicago Bulls, software discovered pattern coaching staff missed • 16 of 29 NBA teams have turned to Advanced Scout to mine play-by-play data Data Mining Course, Sharif University of Technology

  12. Why Data Mining? • “...we are drowning in information but starved for knowledge.” (Naisbitt, author Megatrends) • Not enough trained analysts available to translate data into knowledge • Data mining fueled by several factors • Explosive growth in data collection • The storage of enterprise-wide data in data warehouses • Increased availability of Web clickstream data • The tremendous growth in computing power and storage capacity • Development of off-the-shelf commercial data mining software products Data Mining Course, Sharif University of Technology

  13. Potential applications • Data analysis and decision support • Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and detection of unusual patterns • Other applications • Text mining (news group, email, documents) and Web mining • Stream data mining • DNA and bio-data analysis Data Mining Course, Sharif University of Technology

  14. Market Analysis and Management • Where does the data come from? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • Cross-market analysis • Associations/co-relations between product sales, & prediction based on such association • Customer profiling • What types of customers buy what products (clustering or classification) • Customer requirement analysis • identifying the best products for different customers • predict what factors will attract new customers • Provision of summary information • multidimensional summary reports • statistical summary information (data central tendency and variation) Data Mining Course, Sharif University of Technology

  15. Corporate Analysis & Risk Management • Finance planning and asset evaluation • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning • summarize and compare the resources and spending • Competition • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market Data Mining Course, Sharif University of Technology

  16. Fraud Detection & Mining Unusual Patterns • Approaches: Clustering & model construction for frauds, outlier analysis • Applications: Health care, retail, credit card service, telecomm. • Auto insurance: ring of collisions • Money laundering: suspicious monetary transactions • Medical insurance • Professional patients, ring of doctors, and ring of references • Unnecessary or correlated screening tests • Telecommunications: phone-call fraud • Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm • Retail industry • Analysts estimate that 38% of retail shrink is due to dishonest employees • Anti-terrorism Data Mining Course, Sharif University of Technology

  17. Other Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Data Mining Course, Sharif University of Technology

  18. Data Mining: A KDD Process Knowledge • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Data Mining Course, Sharif University of Technology Databases

  19. Steps of a KDD Process • Learning the application domain • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge Data Mining Course, Sharif University of Technology

  20. The Need for Human Direction of Data Mining • Some early data mining definitions described process as “automatic” • “…this has misled many people into believing data mining is product that can be bought rather than a discipline that must be mastered.” (Berry, Linoff) • Automation no substitute for human input • Data mining is easy to do badly • Understanding statistical and mathematical model structures of underlying software required • Humans need to be actively involved in every phase of data mining process • Task of data mining should be integrated into human process of problem solving Data Mining Course, Sharif University of Technology

  21. Cross Industry Standard Process: CRISP-DM • Cross-Industry Standard Process for Data Mining (CRISP-DM) developed in 1996 • Contributors include DaimlerChrysler, SPSS, and NCR • Developed to fit data mining into general business strategy • Process vendor and tool-neutral • Non-proprietary and freely available • Data mining projects follow iterative, adaptive life cycle consisting of 6 phases • Phase sequences are adaptive • Next, Figure 1.1 illustrates CRISP-DM lifecycle Data Mining Course, Sharif University of Technology

  22. Business / Research Understanding Phase Data Understanding Phase Deployment Phase Data Preparation Phase Evaluation Phase Modeling Phase Cross Industry Standard Process: CRISP-DM (cont’d) • Iterative CRIP-DM process shown in outer circle • Most significant dependencies between phases shown • Next phase depends on results from preceding phase • Returning to earlier phase possible before moving forward Data Mining Course, Sharif University of Technology

  23. Cross Industry Standard Process: CRISP-DM (cont’d) • (1) Business Understanding Phase • Define business requirements and objectives • Translate objectives into data mining problem definition • Prepare initial strategy to meet objectives • (2) Data Understanding Phase • Collect data • Assess data quality • Perform exploratory data analysis (EDA) • (3) Data Preparation Phase • Cleanse, prepare, and transform data set • Prepares for modeling in subsequent phases • Select cases and variables appropriate for analysis Data Mining Course, Sharif University of Technology

  24. Cross Industry Standard Process: CRISP-DM (cont’d) • (4) Modeling Phase • Select and apply one or more modeling techniques • Calibrate model settings to optimize results • If necessary, additional data preparation may be required • (5) Evaluation Phase • Evaluate one or more models for effectiveness • Determine whether defined objectives achieved • Make decision regarding data mining results before deploying to field Data Mining Course, Sharif University of Technology

  25. Cross Industry Standard Process: CRISP-DM (cont’d) • (6) Deployment Phase • Make use of models created • Simple deployment: generate report • Complex deployment: implement additional data mining effort in another department • In business, customer often carries out deployment based on model • See http://www.crisp-dm.org for more information Data Mining Course, Sharif University of Technology

  26. Case Study 1 • Analyzing Automobile Warranty Claims • Business Understanding • Objectives include improving customer satisfaction and reducing costs • Manufacturing engineers consulted to formulate business problems • Data mining techniques used to uncover possible issues: • Warranty claim interdependencies? • Past claims associated with future claims? • Association between claim and repair facility? Data Mining Course, Sharif University of Technology

  27. Case Study 1 (cont’d) • Data Understanding • 40GB QUIS database containing 7 million vehicle records used • Vehicle records include manufacturing location, warrant claims, and additional codes • Database unintelligible to non-domain experts • Costly effort to consult with domain experts from different departments • Data Preparation • QUIS discovered to have limited SQL access • Cases and variables manually extracted • Additional variables derived for modeling phase Data Mining Course, Sharif University of Technology

  28. Case Study 1 (cont’d) • Proprietary data mining software used • Data format requirements varied for different algorithms • Resulted in exhaustive pre-processing of data • Modeling Phase • Applied Bayesian networks and association rules to uncover dependencies between warranty claims • Discovered specific combination of construction specifications doubles probability of electrical cable claim • Investigated whether some garages had more claims than others • Remaining results confidential Data Mining Course, Sharif University of Technology

  29. Case Study 1 (cont’d) • Evaluation • Researchers disappointed in results • Association rules could not be generalized • Rules “not interesting” according to domain experts • Data models fell short of business objectives • Legacy databases not suited to data mining • Proposal suggested database redesign for future data mining efforts • Deployment • Foregoing effort identified as pilot project, models not deployed • Future data mining efforts planned to integrate more closely to database systems at DaimlerChrysler Data Mining Course, Sharif University of Technology

  30. Case Study 1 (cont’d) • Summary • Uncovering hidden nuggets very difficult • During each phase, researchers encountered roadblocks • Applying new data mining effort problematic • Data mining effort requires management support • Substantial human participation required at every stage • Installation, configuration, and data mining modeling not magic • Wrong analysis leads to possibly expensive policy recommendations • No guarantee data mining effort delivers actionable results • However, used properly, data mining may provide profitable results Data Mining Course, Sharif University of Technology

  31. Fallacies of Data Mining • Four Fallacies of Data Mining (Louie, Nautilus Systems, Inc.) • Fallacy 1 • Set of tools can be turned loose on data repositories • Finds answers to all business problems • Reality 1 • No automatic data mining tools solve problems • Rather, data mining is process (CRISP-DM) • Integrates into overall business objectives • Fallacy 2 • Data mining process is autonomous • Requires little oversight Data Mining Course, Sharif University of Technology

  32. Fallacies of Data Mining (cont’d) • Reality 2 • Requires significant intervention during every phase • After model deployment, new models require updates • Continuous evaluative measures monitored by analysts • Fallacy 3 • Data mining quickly pays for itself • Reality 3 • Return rates vary • Depending on startup, personnel, data preparation costs, etc. • Fallacy 4 • Data mining software easy to use Data Mining Course, Sharif University of Technology

  33. Fallacies of Data Mining (cont’d) • Reality 4 • Ease of use varies across projects • Analysts must combine subject matter knowledge with specific problem domain • Two Additional Fallacies (Larose) • Fallacy 5 • Data mining identifies causes of business problems • Reality 5 • Knowledge discovery process uncovers patterns of behavior • Humans interpret results and identify causes Data Mining Course, Sharif University of Technology

  34. Fallacies of Data Mining (cont’d) • Fallacy 6 • Data mining automatically cleans data in databases • Reality 6 • Data mining often uses data from legacy systems • Data possibly not examined or used in years • Organizations starting data mining efforts confronted with huge data preprocessing task Data Mining Course, Sharif University of Technology

  35. What Tasks Can Data Mining Accomplish? • Six common data mining tasks • Description • Estimation • Prediction • Classification • Clustering • Association • (1) Description • Describes patterns or trends in data • For example, pollster may uncover patterns suggesting those laid-off less likely to support incumbent • Descriptions of patterns, often suggest possible explanations Data Mining Course, Sharif University of Technology

  36. What Tasks Can Data Mining Accomplish? (cont’d) • For example, those laid-off now less financially secure; therefore, prefer alternate candidate • Data mining models should be transparent • That is, results should be interpretable by humans • Some data mining methods more transparent than others • For example, Decision Trees (transparent) <-> Neural Networks (opaque) • High-quality description accomplished using Exploratory Data Analysis (EDA) • Graphical method of exploring patterns and trends in data Data Mining Course, Sharif University of Technology

  37. What Tasks Can Data Mining Accomplish? (cont’d) • (2)Estimation • Similar to Classification task, except target variable numeric • Models built from complete data records • Records include values for each predictor field and numeric target variable in training set • For new observations, estimate of target variable made • For example, estimate a patient’s systolic blood pressure, based on patient’s age, gender, body-mass index, and sodium levels • Here, estimation model built from training set records • Model then estimates value for new case Data Mining Course, Sharif University of Technology

  38. What Tasks Can Data Mining Accomplish? (cont’d) • Estimation Tasks in Business and Research: • Estimate amount of money, family of four will spend on back-to-school shopping • Estimate percentage decrease in rotary movement sustained to NFL player with knee injury • Estimate number of points basketball player scores when double-teamed in playoffs • Estimate GPA of graduate student, based on student’s undergraduate GPA Data Mining Course, Sharif University of Technology

  39. What Tasks Can Data Mining Accomplish? (cont’d) • Figure 1.2 shows scatter plot of graduate GPA against undergraduate GPA • Linear regression finds line (blue) best approximating relationship between two variables • Regression line estimates student’s graduate GPA based on their undergraduate GPA Data Mining Course, Sharif University of Technology

  40. What Tasks Can Data Mining Accomplish? (cont’d) • Minitab statistical software produces regressionequation ŷ = 1.24 + 0.67x • Therefore, estimated student’s graduate GPA = 1.24 plus 0.67 times their undergraduate GPA • For example, suppose student’s undergraduate GPA = 3.0 • According to estimation model • Estimated student’s graduate GPA = 1.24 + 0.67(3.0) = 3.25 • Point (x = 3.0, ŷ = 3.25) lies on regression line • Statistical Analysis uses several estimation methods: point estimation, confidence interval estimation, linear regression and correlation, and multiple regression Data Mining Course, Sharif University of Technology

  41. ? Stock Price ? ? Q1 Q2 Q3 Q4 What Tasks Can Data Mining Accomplish? (cont’d) • (3) Prediction • Similar to classification and estimation, except results lie in the future • Prediction Tasks in Business and Research: • Predict price of stock 3 months into future, based on past performance Data Mining Course, Sharif University of Technology

  42. What Tasks Can Data Mining Accomplish? (cont’d) • Predict percentage increase in traffic deaths next year, if speed limit increased • Predict whether molecule in newly discovered drug leads to profitable pharmaceutical drug • Methods used for classification and estimation applicable to prediction • Includes point estimation, confidence interval estimation, linear regression and correlation, and multiple regression Data Mining Course, Sharif University of Technology

  43. What Tasks Can Data Mining Accomplish?(cont’d) • (4) Classification • Classification requires categorical target variable such as Income Bracket • Three values include “High”, “Middle”, “Low” • Data model examines records containing input fields and target field • Table shows several records from data set Data Mining Course, Sharif University of Technology

  44. What Tasks Can Data Mining Accomplish?(cont’d) • Records of persons in data set used to “train” classification model • First, Model built from data records, where value of categorical target variable (Income Bracket) already known • Algorithm “first learns about” which combinations of input fields are associated with Income Bracket values in training set • For example, algorithm may determine that older females associated with high income • Next, trained model examines new records • Information regarding Income Bracket not available Data Mining Course, Sharif University of Technology

  45. What Tasks Can Data Mining Accomplish?(cont’d) • Based on classifications in training set, new records classified • For example, 63-year old female professor might be classified in “High” income bracket • Classification Tasks in Business and Research: • Determine whether credit card transaction fraudulent • Assessing mortgage application to determine “good” or “bad” credit risk • Diagnosing whether particular disease present Data Mining Course, Sharif University of Technology

  46. What Tasks Can Data Mining Accomplish?(cont’d) • Determine if will written by deceased, or fraudulently by someone else • Identify whether certain financial behavior represents terrorist threat • Scatter plot shows Na/K ratio against Age for 200 patients • For example, classify drug type to prescribe based on patient’s age and sodium/potassium ratio Data Mining Course, Sharif University of Technology

  47. What Tasks Can Data Mining Accomplish?(cont’d) • Actual drug type prescribed symbolized by shade (light, medium, dark) of points • Suppose prescription of new patient based on this data set? • Prescribe which drug for young patient with high Na/K ratio? • Young patients plotted on left • High Na/K plotted on upper-half • Quadrant of graph shows light points • Recommended drug = Y (corresponds to light points) • Prescribe which drug for older patient with low Na/K ratio? • Lower-right half of graph shows patients prescribed different drug types Data Mining Course, Sharif University of Technology

  48. What Tasks Can Data Mining Accomplish?(cont’d) • Definitive classification cannot be made • More information required to make decision • Examples show graphs are helpful for understanding two-dimensional data • However, classification often requires many input attributes • More sophisticated methods of classification required • Commonly used algorithms for classification include k-Nearest Neighbor, Decision Trees, and Neural Networks Data Mining Course, Sharif University of Technology

  49. What Tasks Can Data Mining Accomplish?(cont’d) • (5) Clustering • Refers to grouping records into classes of similar objects • Clustering algorithm seeks to segment data set into homogeneous subgroups • Where similarity of records in clusters maximized, and similarity to records outside clusters minimized • Target variable not specified • For example, Claritas, Inc. PRIZM software clusters demographic profiles for different geographic areas according to zip code Data Mining Course, Sharif University of Technology

  50. What Tasks Can Data Mining Accomplish?(cont’d) • Table shows 62 distinct “lifestyle” types used by PRIZM Data Mining Course, Sharif University of Technology