1 / 26

Introduction Data 資料探勘方法 關聯性法則 (Association Rules) 分類法 (Classification) 叢集法 (Clustering)

大綱. Introduction Data 資料探勘方法 關聯性法則 (Association Rules) 分類法 (Classification) 叢集法 (Clustering) Applications 結論. What Is Data Mining?. Data mining: knowledge discovery from data)

gabby
Download Presentation

Introduction Data 資料探勘方法 關聯性法則 (Association Rules) 分類法 (Classification) 叢集法 (Clustering)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 大綱 • Introduction • Data • 資料探勘方法 • 關聯性法則 (Association Rules) • 分類法 (Classification) • 叢集法 (Clustering) • Applications • 結論

  2. What Is Data Mining? • Data mining: knowledge discovery from data) • Extraction of interesting (non-trivial,previously unknown and potentially useful) patterns or knowledge from huge amount of data • Alternative names • Knowledge discovery in databases (KDD),

  3. Database Technology Statistics Data Mining Visualization Machine Learning A.I. Other Disciplines Algorithm Data Mining: Confluence of Multiple Disciplines

  4. Why Not Traditional Data Analysis? • Tremendous amount of data • High complexity of data

  5. Why Data Mining? • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets.

  6. Business Objectives [PPS 03] [BL 97] [MMPH 03][BM 98] [LPSHG 01] [SM 00][FCJ 01][MPT 99]

  7. Data Mining: On What Kinds of Data? • Database-oriented data sets and applications • Relational database, data warehouse, transactional database • Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bio-sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web

  8. Steps of Data Mining

  9. Steps of a KDD Process • Learning the application domain • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge

  10. Data Mining Functionalities • Outlier analysis • Outlier: Data object that does not comply with the general behavior of the data • Noise or exception? Useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: e.g., regression analysis • Sequential pattern mining: e.g., digital camera  large SD memory • Periodicity analysis • Similarity-based analysis

  11. Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk},

  12. Classification: Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

  13. Classification Task Decision Tree

  14. Weather Data: Play or not Play? Note: Outlook is the Forecast, no relation to Microsoft email program

  15. Example Tree for “Play?” Outlook sunny rain overcast Yes Humidity Windy high normal false true No Yes No Yes

  16. Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

  17. Data Mining for Retail Industry • Retail industry: huge amounts of data on sales, customer shopping history, etc. • Applications of retail data mining • Identify customer buying behaviors • Discover customer shopping patterns and trends • Improve the quality of customer service • Achieve better customer retention and satisfaction • Enhance goods consumption ratios • Design more effective goods transportation and distribution policies

  18. Data Mining in Retail Industry: Examples • Design and construction of data warehouses based on the benefits of data mining • Multidimensional analysis of sales, customers, products, time, and region • Analysis of the effectiveness of sales campaigns • Customer retention: Analysis of customer loyalty • Use customer loyalty card information to register sequences of purchases of particular customers • Use sequential pattern mining to investigate changes in customer consumption or loyalty • Suggest adjustments on the pricing and variety of goods • Purchase recommendation and cross-reference of items

  19. Financial Data Mining • Classification and clustering of customers for targeted marketing • multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group • Detection of money laundering and other financial crimes • integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs) • Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)

  20. Financial Data Mining • Classification and clustering of customers for targeted marketing • multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group • Detection of money laundering and other financial crimes • integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs) • Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)

  21. Data Mining for Telecomm. Industry (1) • A rapidly expanding and highly competitive industry and a great demand for data mining • Understand the business involved • Identify telecommunication patterns • Catch fraudulent activities • Make better use of resources • Improve the quality of service • Multidimensional analysis of telecommunication data • Intrinsically multidimensional: calling-time, duration, location of caller, location of callee, type of call, etc.

  22. Data Mining for Telecomm. Industry (2) • Fraudulent pattern analysis and the identification of unusual patterns • Identify potentially fraudulent users and their atypical usage patterns • Detect attempts to gain fraudulent entry to customer accounts • Discover unusual patterns which may need special attention • Multidimensional association and sequential pattern analysis • Find usage patterns for a set of communication services by customer group, by month, etc. • Promote the sales of specific services • Improve the availability of particular services in a region • Use of visualization tools in telecommunication data analysis

  23. Biomedical and DNA Data Analysis • DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). • Gene: a sequence of hundreds of individual nucleotides arranged in a particular order • Humans have around 30,000 genes • Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes • Semantic integration of heterogeneous, distributed genome databases • Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data • Data cleaning and data integration methods developed in data mining will help

  24. DNA Analysis: Examples • Similarity search and comparison among DNA sequences • Compare the frequently occurring patterns of each class (e.g., diseased and healthy) • Identify gene sequence patterns that play roles in various diseases • Association analysis: identification of co-occurring gene sequences • Most diseases are not triggered by a single gene but by a combination of genes acting together • Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples • Path analysis: linking genes to different disease development stages • Different genes may become active at different stages of the disease • Develop pharmaceutical interventions that target the different stages separately • Visualization tools and genetic data analysis

  25. Other Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

  26. 結論 • Data mining is a young discipline with wide and diverse applications • There is still a nontrivial gap between general principles of data mining and domain-specific, effective data mining tools for particular applications

More Related