Introduction Data 資料探勘方法關聯性法則 (Association Rules) 分類法 (Classification) 叢集法 (Clustering)

大綱 • Introduction • Data • 資料探勘方法 • 關聯性法則 (Association Rules) • 分類法 (Classification) • 叢集法 (Clustering) • Applications • 結論

What Is Data Mining? • Data mining: knowledge discovery from data) • Extraction of interesting (non-trivial,previously unknown and potentially useful) patterns or knowledge from huge amount of data • Alternative names • Knowledge discovery in databases (KDD),

Database Technology Statistics Data Mining Visualization Machine Learning A.I. Other Disciplines Algorithm Data Mining: Confluence of Multiple Disciplines

Why Not Traditional Data Analysis? • Tremendous amount of data • High complexity of data

Why Data Mining? • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets.

Business Objectives [PPS 03] [BL 97] [MMPH 03][BM 98] [LPSHG 01] [SM 00][FCJ 01][MPT 99]

Data Mining: On What Kinds of Data? • Database-oriented data sets and applications • Relational database, data warehouse, transactional database • Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bio-sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web

Steps of Data Mining

Steps of a KDD Process • Learning the application domain • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge

Data Mining Functionalities • Outlier analysis • Outlier: Data object that does not comply with the general behavior of the data • Noise or exception? Useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: e.g., regression analysis • Sequential pattern mining: e.g., digital camera  large SD memory • Periodicity analysis • Similarity-based analysis

Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk},

Classification: Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Task Decision Tree

Weather Data: Play or not Play? Note: Outlook is the Forecast, no relation to Microsoft email program

Example Tree for “Play?” Outlook sunny rain overcast Yes Humidity Windy high normal false true No Yes No Yes

Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Data Mining for Retail Industry • Retail industry: huge amounts of data on sales, customer shopping history, etc. • Applications of retail data mining • Identify customer buying behaviors • Discover customer shopping patterns and trends • Improve the quality of customer service • Achieve better customer retention and satisfaction • Enhance goods consumption ratios • Design more effective goods transportation and distribution policies

Data Mining in Retail Industry: Examples • Design and construction of data warehouses based on the benefits of data mining • Multidimensional analysis of sales, customers, products, time, and region • Analysis of the effectiveness of sales campaigns • Customer retention: Analysis of customer loyalty • Use customer loyalty card information to register sequences of purchases of particular customers • Use sequential pattern mining to investigate changes in customer consumption or loyalty • Suggest adjustments on the pricing and variety of goods • Purchase recommendation and cross-reference of items

Financial Data Mining • Classification and clustering of customers for targeted marketing • multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group • Detection of money laundering and other financial crimes • integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs) • Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)

Data Mining for Telecomm. Industry (1) • A rapidly expanding and highly competitive industry and a great demand for data mining • Understand the business involved • Identify telecommunication patterns • Catch fraudulent activities • Make better use of resources • Improve the quality of service • Multidimensional analysis of telecommunication data • Intrinsically multidimensional: calling-time, duration, location of caller, location of callee, type of call, etc.

Data Mining for Telecomm. Industry (2) • Fraudulent pattern analysis and the identification of unusual patterns • Identify potentially fraudulent users and their atypical usage patterns • Detect attempts to gain fraudulent entry to customer accounts • Discover unusual patterns which may need special attention • Multidimensional association and sequential pattern analysis • Find usage patterns for a set of communication services by customer group, by month, etc. • Promote the sales of specific services • Improve the availability of particular services in a region • Use of visualization tools in telecommunication data analysis

Biomedical and DNA Data Analysis • DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). • Gene: a sequence of hundreds of individual nucleotides arranged in a particular order • Humans have around 30,000 genes • Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes • Semantic integration of heterogeneous, distributed genome databases • Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data • Data cleaning and data integration methods developed in data mining will help

DNA Analysis: Examples • Similarity search and comparison among DNA sequences • Compare the frequently occurring patterns of each class (e.g., diseased and healthy) • Identify gene sequence patterns that play roles in various diseases • Association analysis: identification of co-occurring gene sequences • Most diseases are not triggered by a single gene but by a combination of genes acting together • Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples • Path analysis: linking genes to different disease development stages • Different genes may become active at different stages of the disease • Develop pharmaceutical interventions that target the different stages separately • Visualization tools and genetic data analysis

Other Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

結論 • Data mining is a young discipline with wide and diverse applications • There is still a nontrivial gap between general principles of data mining and domain-specific, effective data mining tools for particular applications

Introduction Data 資料探勘方法關聯性法則 (Association Rules) 分類法 (Classification) 叢集法 (Clustering)