1 / 46

Section 5

Section 5. Data Mining. Section Content. 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5. 7 Applications of Data Mining. 5.1 Data Mining Introduction. Data mining:

nusa
Download Presentation

Section 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Section 5 Data Mining

  2. Section Content • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

  3. 5.1 Data Mining Introduction • Data mining: • the discovery of new information in terms of patterns or rules from huge amounts of data • mining tools should identify these patterns, rules and trends with minimal user input • data mining is related to • statistics: exploratory data analysis • artificial intelligence: knowledge discovery and machine learning • techniques from machine learning, statistics, neural networks and genetic algorithms are used • due to the vastness of the amount of data, efficiency/scalability of data mining algorithms is a key issue CA306 Data Mining

  4. Data Mining and Data Warehousing • The goal of data warehousing is to support decision making with data. • Data mining can help in conjunction with a data warehouse with certain types of decisions. • Data mining helps to extract new patterns/rules that cannot be found by merely querying or processing data. • Aggregated or summarised collections of data in warehouses improves the efficiency of data mining in these cases. • The potential use of data mining needs to be considered early in the design of a data warehouse. CA306 Data Mining

  5. Sections Covered • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

  6. 5.2 Knowledge Discovery • Data mining is part of the knowledge discovery process: • data selection • data cleansing • enrichment • data transformation / encoding • data mining • reporting and display • Example: • Database: Transaction database for a goods retailer • Client data: name, zip code, phone, date of purchase, item code, price, quantity, total amount CA306 Data Mining

  7. Knowledge Discovery - Example • New knowledge can be discovered from the client data • data selection: • data about specific items or categories of items • items from stores in specific regions • data cleansing: • correct incorrect zip codes • eliminate records with incorrect phone numbers • enrichment: add additional information • age, income, credit rating of client • data transformation: reduce the amount of data • group items into product categories • group zip codes into regions CA306 Data Mining

  8. Data Mining - Knowledge Discovery • Data mining might discover • co-occurrences - items that are typically bought together • association rules - when a customer buys video equipment, he/she also buys another electronic gadget • sequential patters - when a customer buys a camera, then within 3 months he/she buys photographic supplies • classification trees - customers can be classified by frequency of visits, types of finance used, etc. combined with statistics about the classes • This information can then be used to for example • optimise store locations • run promotions • plan seasonal marketing strategies CA306 Data Mining

  9. Goals of Data Mining • Prediction • show how certain attributes within the data will behave in the future • example: predict what customers will buy under certain discounts • example: predict sales volume for some period • Identification • data patterns can be used to identify the existence of an item, an event, or an activity • example: detecting intruders by the commands they execute CA306 Data Mining

  10. Goals of Data Mining • Classification • partition data such that different classes or categories can be identified • example: customers can be categorised into regular and infrequent shoppers, into discount-seeking customers etc. • categorisation - e.g. into food categories - can reduce the complexity of data mining • Optimisation • optimise the use of limited resources (time, space, money, etc) • example: what are the best productsto spend our money on over the next three months? CA306 Data Mining

  11. Types of Knowledge Discovered • Co-occurrences • collection of items/actions/events that occur together • example: items that are bought together by a consumer in a shop • Association rules • correlation of a set of items with another range of values for another set of variables • example: when someone buys bread, he/she is likely to buy cheese • Classification hierarchies • create a hierarchy of classes from an existing set of events or transactions • example: customers might be divided into a credit worthiness hierarchy based on their previous credit transactions CA306 Data Mining

  12. Types of Knowledge Discovered • Sequential patterns • search for a sequence of events or actions • example: a patient that underwent cardiac surgery and later developed high blood urea, is likely to suffer from kidney problems • Patterns within time series • detection of similarities within positions of the time series • example: a pattern in a time series of stock market prices may be used to predict employment rates • Categorisation and segmentation • partition a set of events of items into segments/categories/classes • example: treatment data on a disease can be partitioned into groups based on the side effects that are caused CA306 Data Mining

  13. Counting Co-occurrences • The problem is to count co-occurring itemsets - motivated by market basket analysis. • A database of consumer transactions forms the basis • transaction: a single visit to a store, an order at a virtual store (Web site), or a single order through a mail-order catalog • a transaction consists of a transaction ID, customer ID, date, item and quantity • The goal is to identify items that are typically purchased together. • This can be used to improve the layout of shops or catalogs. CA306 Data Mining

  14. Frequent Itemsets (1) • Consider the following transaction table: Transaction Customer Date Items bought 101 12 11/09 milk, bread, juice 792 13 12/09 milk, juice 1130 14 14/09 milk, eggs 1735 13 14/09 bread, coffee, biscuits Items bought in one visit are already grouped together into itemsets. • Support of an itemset: the fraction of transactions that contain all items in the itemset • Examples • {milk, juice} has a support of 50 % • {bread, coffee} has a support of 25 % CA306 Data Mining

  15. Frequent Itemsets (2) • Large itemsets are itemsets that have a certain minimum support, i.e. are itemsets that occur frequently. • Example: • for a minimum support of 40%, the large itemsets are {milk, juice}, {milk}, {juice}, {bread} • Proposition: • every subset of a large itemset is also a large itemset • Algorithm: • large itemsets can be computed incrementally • start with itemsets of cardinality 1 that have the required support CA306 Data Mining

  16. Sections Covered • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

  17. 5.3 Association Rules • A database can be regarded as a collection of transactions. • Each transaction involves a set of items. • Example: the items in a basket that a shopper uses in a supermarket Transaction Time Items bought 101 6:35 milk, bread, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:40 bread, coffee, biscuits CA306 Data Mining

  18. Association Rules • An association rule is of form X => Y where X and Y are two disjoint sets of items • Example: • for sets of goods as itemsets X and Y, the expression X => Y means that if a customer buys X, he/she is also likely to buy Y. • if the customer buys milk, he/she is also likely to buy juice. • The support for a rule X => Y is the percentage of transactions that hold all of the items in the union X  Y. • Examples: • Milk => Juice has 50% support • Bread => Juice has 25% support CA306 Data Mining

  19. Association Rules • The confidence of a rule X => Y is the percentage (fraction) of all transactions including X that also include Y. • Example: • the rule Milk => Juice has confidence 66.7% • that means that 2/3 of all transactions with milk also include juice • Note that support and confidence might be different. • The goal is to discover rules with a certain minimum support and confidence. • These rules can be used for prediction: for a rule Pen => Ink offer discounts on pens and you might increase ink sales. CA306 Data Mining

  20. Association Rules • How to compute these rules? • Generate large itemsets (itemsets with a certain minimum support) • For each large itemset X, generate all rules with a certain minimum confidence (mconf): for X and Y  X, let Z = X - Y (divide X into Y and Z) if support(X) / support(Y) > mconf then Y => Z is a valid rule the confidence of rule Y => Z is defined as support(X) / support(Y) • Example: for X={milk, juice} and Y={milk}  {milk, juice}, let Z={juice} X, Y, Z have support 50%, 75% and 50%, resp. (support for itemsets 5.14) for mconf=40% {milk} => {juice}is a valid rule with confidence 66.7% ( 50/75 ) CA306 Data Mining

  21. Generating Association Rules • In principle, generating rules based on large itemsets and their support is straightforward. • Computing all large itemsets and their support creates an efficiency problem if the number of items is very high. • If m is the number of items, then 2m is the number of different itemsets. • Example: a typical supermarket might have several thousands of items. • Computing the support of all itemsets might take a long time. • Reducing the combinatorial search space is therefore important - the following properties can be used: • subsets of large itemsets are large • extensions of small itemsets are small CA306 Data Mining

  22. Association Rules - Algorithms • Outline of an algorithm that finds large itemsets: • Step 1: • test the support for itemsets of length 1 - called 1-itemsets - by scanning the database; • discard those that do not meet the minimum requirement. • Step 2: • extend large 1-itemsets into 2-itemsets by appending one item each time (this generates all itemsets of length two); • test the support and eliminate all 2-itemsets that do not meet the minumum support. • Step 3: • repeat the above steps: extend (k-1)-itemsets into k-itemsets. CA306 Data Mining

  23. Association Rules among Hierarchies • Items might be divided among disjoint hierarchies based on some classification, e.g. Beverage can be divided into Juice and Milk Associations might occur among the hierarchies of items. • Example:healthy frozen yoghurt => bottled water • Particularly interesting are associations across hierarchies. • this kind of information can be used to arrange different kinds of items in a supermarket CA306 Data Mining

  24. Negative Associations • Negative associations are more difficult to detect than positive associations. • Example: 60% of customers who buy crisps do not buy bottled water. • There are usually more negative associations than positive ones. • The majority of itemset combinations do not occur in databases. • Finding interesting negative associations can be difficult. CA306 Data Mining

  25. Association Rules - Additional Considerations • Sampling: • For very large databases, sampling improves efficiency. • Truly representative samples can help to find most of the rules. • The danger is that • false positives might be discovered (large itemsets that are not truly large); • true positives might be missing. • Other problems: • Cardinality of itemsets and volume of transactions can be very high. • Variablity of transactions (geographical, season) makes sampling difficult. • Multiple classifications along different dimensions. CA306 Data Mining

  26. Sections Covered • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

  27. 5.4 Sequential Patterns • Sequential patterns are based on sequences of itemsets. • Assume transactions to be ordered by time. • Example: • transactions in a supermarket • {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} may be based on three visits of a customer • A subsequence of a sequence is obtained by deleting one or more itemsets. • Example: • let {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} be the orginal sequence • {milk, bread, juice} ; {bread, eggs} is a subsequence • {milk, bread, juice} ; {milk, coffee, biscuits} is a subsequence CA306 Data Mining

  28. Support for Sequences • A sequence {a1, ... , am} is contained in another sequence S if S has a subsequence {b1, ..., bn} such that ai bi for 1 <= i <= n • Example: • {milk, bread} ; {coffee, biscuits} is contained in {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} • The support of a sequence S is the percentage of a set of given sequences that contain S as a subsequence. CA306 Data Mining

  29. Discovery of Patterns in Time Series • Time series are sequences of events. • An event might be a fixed type of transaction. • Example: • closing price of a stock or fund each day. • Analysis of time series: • find period of time in which the stock did not fluctuate more than 1% • find period (week/month/quarter) with the greatest loss • identify stocks with similar behaviour CA306 Data Mining

  30. Sections Covered • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

  31. 5.5 Classification and Regression • Classification Rules • Regression • Tree-structured Rules CA306 Data Mining

  32. Discovery of Classification Rules • Classification means defining/identifying a function that maps an object into one of many possible classes. • Example: a bank wants to classify loan applicants into “loanworthy” and “not loanworthy” • a classification rule could define the classification • not loanworthy: current monthly debt obligation exceeds 25% of monthly net income • loanworthy: otherwise • loanworthiness is a dependent,categorical attribute • In general there is one rule (set) per class (var1 in range1) and ... and (varn in rangen) => object O in class C1 var1 , ..., varn are the predictor attributes CA306 Data Mining

  33. Support and Confidence • Again we can define support and confidence for these rules. • The support for a classification conditionC is the percentage of tuples that satisfy C. • The support for a ruleC1 => C2 is the support for the condition C1 C2. (C1 AND C2 is the set of objects in both C1 and C2.) • Consider those tuples that satisfy condition C1. The confidencefor a ruleC1 => C2 is the percentage of such rules that also satisfy condition C2. CA306 Data Mining

  34. Regression • Regression is similar to classification, except that the dependent variable is numerical (and not categorical). • Rules (such as classification rules) can be regarded as functions. • A regression rule is a function that maps variables into a target class variable. • Example: LabTest(patientID, test1, ... , testn) • the values in that relation result from a series of lab tests • the target variable P is the probability of survival - a numerical variable • the regression rule: (test1 in range1) and ... and (testn in rangen) => P = x • the regression function is P = f(test1, ... , testn) CA306 Data Mining

  35. Regression (2) • If P appears as a function y = f(x1, ... , xn) and f is linear in the domain variables, then the process of deriving f from a given set of tuples <x1, ... , xn, y> is called linear regression. • Linear regression is a common statistical technique. CA306 Data Mining

  36. Tree-Structured Rules • Specific classification and regression rules shall now be examined. • These are rules that can be represented as trees - called classification trees or decision trees. • These trees are typically the output of the data mining activity. • Each path from a root to a leaf node represents one classification rule. • Example: Insurance risk determination for motor insurance Age <= 25 > 25 Car Type NO sports family YESNO CA306 Data Mining

  37. Decision Trees • A decision tree is a graphical representation of a collection of classification rules. • Each node in the tree is labelled with a predictor or splitting attribute. • Each outgoing edge of an internal node is labelled with a predicate that involves the splitting attribute. • Each leaf node is labelled with a value of the depending attribute. • A classification rule can be associated with each leaf node - constructed as the conjunction of the predicates: • Age <= 25 and Car Type = sports for the YES-leaf • Decision trees are constructed in two phases: • growth phase: create tree based on specialised rules from an input database (relation) • pruning phase: reduce tree size by generalising rules CA306 Data Mining

  38. Sections Covered • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

  39. 5.6 Other Types of Data Mining • Neural Networks • Genetic Algorithms • Clustering and Segmentation CA306 Data Mining

  40. Neural Networks • Techniques from artificial intelligence can be used to generalise regression. • Neural networks provide an iterative method to carry out this generalised regression. • Neural networks use a curve-fitting approach to infer a function from a set of samples. • This process is based on learning: a test sample is the initial input, the system then incrementally infers functions based on more samples • Neural networks can be applied to classification problems. • Modelling time series with neural networks is difficult. CA306 Data Mining

  41. Genetic Algorithms (1) • Genetic algorithms (GA) are a class of randomised search procedures for adaptive and robust search over a wide range of search topologies. • Principle: • Genetic algorithms extend the idea of characterising human DNA by a four-letter alphabet (A,C,T,G). • Construction: • Devise an alphabet that allows the encoding of a solution to the decision problem in terms of strings of that alphabet. • Usage: • Study the cutting and combination of strings (compare natural reproduction and evolution). • New generations of individuals (solutions) are generated and assessed - survival of the fittest. CA306 Data Mining

  42. Genetic Algorithms (2) • Generation of solutions - comparison with other techniques. • GA search uses a set of solutions during each generation rather than a single solution. • The search in the string-space represents a much larger parallel search in the space of encoded solutions. • The memory of the search completed is represented solely by the set of solutions available for generation. • A GA is a randomised algorithm since search mechanisms use probabilistic operators. • While progressing from one generation to the next, a GA finds near-optimal balance between knowledge acquisition and exploitation by manipulating encoded solutions. CA306 Data Mining

  43. Clustering and Segmentation • Clustering is about identification and classification. • Clustering tries to identify categories (or clusters) to which a data object can be mapped. • The categories can be disjoint or might overlap; they might be organised into trees. • A related problem: multivariate probability density functions. CA306 Data Mining

  44. Sections Covered • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

  45. 5.7 Applications of Data Mining • Decision-making contexts: • marketing: • analysis of customer behaviour based on buying patterns; • determination of marketing strategies (store locations, advertising campaigns, etc); • segmentation of customers, stores, products. • finance: • analysis of creditworthiness of clients; • performance analysis of finance investments; • evaluation of financing options; • fraud detection. CA306 Data Mining

  46. Applications • Manufacturing: • optimisation of resources (machines, manpower, material); • optimal design of manufacturing process, shop-floor layout, etc. • Health care: • analysis of effectiveness of certain treatments; • optimisation of processes in a hospital; • analysing side effects of drugs; • relating patient wellness and doctor qualifications. CA306 Data Mining

More Related