Data Mining Chapter 26 - PowerPoint PPT Presentation

data mining chapter 26 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining Chapter 26 PowerPoint Presentation
Download Presentation
Data Mining Chapter 26

play fullscreen
1 / 31
Data Mining Chapter 26
245 Views
Download Presentation
paul2
Download Presentation

Data Mining Chapter 26

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data Mining Chapter 26

  2. Chapter 1. Introduction • Motivation: Why data mining? • What is data mining? • Data Mining: On what kind of data? • Data mining functionality • Are all the patterns interesting? • Major issues in data mining

  3. Motivation: “Necessity is the Mother of Invention” • Data explosion problem • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • We are drowning in data, but starving for knowledge! • Solution: Data warehousing and data mining • Data warehousing andon-line analytical processing • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

  4. Evolution of Database Technology • 1960s: • Data collection, database creation, IMS and network DBMS • 1970s: • Relational data model, relational DBMS implementation • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s—2000s: • Data mining and data warehousing, multimedia databases, and Web databases

  5. What Is Data Mining? • Data mining (knowledge discovery in databases): • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases • Alternative names: • Data mining: a misnomer? • Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? • (Deductive) query processing. • Expert systems or small ML/statistical programs

  6. Why Data Mining? — Potential Applications • Database analysis and decision support • Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and management • Other Applications • Text mining (news group, email, documents) • Stream data mining • Web mining. • DNA data analysis

  7. Market Analysis and Management (1) • Where are the data sources for analysis? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • Conversion of single to a joint bank account: marriage, etc. • Cross-market analysis • Associations/co-relations between product sales • Prediction based on the association information

  8. Market Analysis and Management (2) • Customer profiling • data mining can tell you what types of customers buy what products (clustering or classification) • Identifying customer requirements • identifying the best products for different customers • use prediction to find what factors will attract new customers • Provides summary information • various multidimensional summary reports • statistical summary information (data central tendency and variation)

  9. Corporate Analysis and Risk Management • Finance planning and asset evaluation • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning: • summarize and compare the resources and spending • Competition: • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market

  10. Fraud Detection and Management (1) • Applications • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. • Approach • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances • Examples • auto insurance: detect a group of people who stage accidents to collect on insurance • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors and ring of references

  11. Fraud Detection and Management (2) • Detecting inappropriate medical treatment • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). • Detecting telephone fraud • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. • Retail • Analysts estimate that 38% of retail shrink is due to dishonest employees.

  12. Other Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

  13. Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

  14. Steps of a KDD Process • Learning the application domain: • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation: • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge

  15. Data Mining: On What Kind of Data? • Relational databases • Data warehouses • Transactional databases • Advanced DB and information repositories • Object-oriented and object-relational databases • Spatial and temporal data • Time-series data and stream data • Text databases and multimedia databases • Heterogeneous and legacy databases • WWW

  16. Data Mining Functionalities

  17. Association Rule Mining • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database • Motivation: finding regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents?

  18. Customer buys both Customer buys diapers Customer buys beer Association Rule Mining (cont.) • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, probability that a transaction contains XY • confidence, c,conditional probability that a transaction having X also contains Y. • Let min_support = 50%, min_conf = 50%: • A  C (50%, 66.7%) • C  A (50%, 100%)

  19. Mining Association Rules—an Example Min. support 50% Min. confidence 50% For rule AC: support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6%

  20. Apriori: A Candidate Generation-and-test Approach • Any subset of a frequent itemset must be frequent • if {beer, diaper, nuts} is frequent, so is {beer, diaper} • every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • generate length (k+1) candidate itemsets from length k frequent itemsets, and • test the candidates against DB • The performance studies show its efficiency and scalability

  21. The Apriori Algorithm — An Example L1 Database TDB C1 1st scan C2 C2 2nd scan L2 L3 C3 3rd scan

  22. The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;

  23. Important Details of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

  24. How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

  25. Classification and Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on climate, or classify cars based on gas mileage • Presentation: decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values

  26. Training Data Classifier (Model) Classification Process: Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

  27. Classifier Testing Data Unseen Data Classification Process: Use the Model in Prediction (Jeff, Professor, 4) Tenured?

  28. Decision Trees Training set

  29. Output: A Decision Tree for “buys_computer” age? <=30 overcast >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes

  30. Cluster and outlier analysis • Cluster analysis • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity • Outlier analysis • Outlier: a data object that does not comply with the general behavior of the data • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

  31. Clusters and Outliers