M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009

COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2009 ARM: Advanced Techniques March 11, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: Apriori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam ARM: Advanced Techniques March 11, 2009 Slide 2

Today's Topics COMP527: Data Mining Parallelization Constraints Multi-Level Rule Mining Other Issues ARM: Advanced Techniques March 11, 2009 Slide 3

Parallelization COMP527: Data Mining Task based distribution vs Data based distribution of processing Data parallelism divides the database into partitions, one for each node. Task parallelism has each node count a different candidate set (eg node 1 counts the 1-itemsets, node 2 counts the 2-itemsets) etc. Main advantages: By using multiple machines, we can avoid database scans as there's more memory to use -- the total size of all candidates is more likely to fit into all of the combined memory across N machines. ARM: Advanced Techniques March 11, 2009 Slide 4

Data Parallelism COMP527: Data Mining Database is divided into N partitions. Each partition can have a different number of records, depending on the capabilities of the node. Each node counts the candidates for its partition, then broadcasts the count to all other nodes. As the counts are received, each node adds up the global support counts so that it has them to determine the candidates in the next level. (eg from 2-itemsets to 3-itemsets)‏ The Count Distribution Algorithm: ARM: Advanced Techniques March 11, 2009 Slide 5

Count Distribution COMP527: Data Mining VERY Pseudo-code for CDA approach... At each processor p: • while potential frequent itemsets: Using partition Dp of Database D, • Count supports in Dp Broadcast Counts On receive(Counts): • globalCounts += Counts Determine candidates for level k+1 ARM: Advanced Techniques March 11, 2009 Slide 6

Task Parallelism COMP527: Data Mining Candidates as well as the database are distributed amongst the processors. Each processor counts the candidates given to it, using the database subset given to it. Each processor then broadcasts the database partition to other processors to use for the global count, which are broadcast again, so that each processor can find globally frequent itemsets. The candidates for the next set are then shared amongst the available processors for the next level. Yes, that's a lot of broadcasting, which is a lot of network traffic, which is a lot of SLOW! (Not going to go through the algorithm for this)‏ ARM: Advanced Techniques March 11, 2009 Slide 7

Constraints COMP527: Data Mining Constrained Association Rule Mining involves simply setting more rules initially as to what is an interesting rule. For example: • Statistics: Support, Confidence, Lift, Correlation • Data: Specify task relevant data to include in transactions • Dimensions: Dimensions of heirarchical data to be used (next time)‏ • Meta-Rules: Form of the useful rules to be found ARM: Advanced Techniques March 11, 2009 Slide 8

Meta-Rules COMP527: Data Mining Examples: • Rule templates • Max/Min number of predicates in antecedent/consequent • Types of relationship among attributes or attribute values Eg: Interested only in pairs of attributes for a customer that buys a certain type of item: P(x,a) AND Q(x,b) => R(x,c)‏ eg: age(x, 20..30) AND income(x, 20k..30k) => buys(x, computer)‏ ARM: Advanced Techniques March 11, 2009 Slide 9

Item Level Thresholds COMP527: Data Mining Or, the Rare Item problem. If you have a very rare item, then its support may not be much higher than the minimum support given for an interesting rule. For example, 48” plasma TVs are sold very infrequently. But these rules could be interesting, especially if they meant it was more likely for someone to buy a big TV. Solution: Multiple Minimum Support thresholds. Simply give rare items a lower threshold to the rest of the dataset. Which could be extended out to one threshold per item... ARM: Advanced Techniques March 11, 2009 Slide 10

MISAPriori COMP527: Data Mining Minimum Item Support A-Priori. The minimum support required for an itemset is the minimum support for any item in the itemset. This breaks our lovely A-Priori downward closure principle :( eg minimum supports: {A 20%, B 3%, C 4%} actual supports: {A 18%, B 4%, C 3%} A is infrequent, but AB is frequent because the threshold of AB is 3% and both A and B meet that threshold. Solution: Sort items by ascending MIS value, then candidate generation only looks at items which are after the current one in this list. ARM: Advanced Techniques March 11, 2009 Slide 11

Multi-level Rule Mining COMP527: Data Mining Our examples have been supermarket baskets. But you don't buy 'bread' you buy a certain brand of bread, with a certain flavour and thickness. eg White Warburton's Toast bread. 2 litre bottle of Tesco's Semi-skimmed milk, not 'milk' We could compact all of the 'milks' and 'breads' together before data mining, but what if buying 'white bread' and 'semi-skimmed milk' together is an interesting rule? As compared to 'skim milk' and 'whole grain bread'. Or Tesco's milk and Tesco's bread? Or ... We need a heirarchy of products to capture these different levels. ARM: Advanced Techniques March 11, 2009 Slide 12

Multi-level Rule Mining COMP527: Data Mining We could have a large tree of how the products inter-relate: All Products Bread Milk White Brown Whole Semi-skim White/Toast Whole/2Litre Brown/Tesco ARM: Advanced Techniques March 11, 2009 Slide 13

Multi-level Rule Mining COMP527: Data Mining We can count support for the items at the bottom level and propogate them upwards. Or count each level for frequency as a top-down approach. Note that what we really need is some sort of clever cluster system with different axes: bread has color, size, brand, thickness... Milk on the other hand has size, brand, skimmed-ness... Beer has a totally different set of properties. But maybe those axes have the same value... Tesco has a milk range and a bread range... but not a beer range... Let's leave that alone :)‏ ARM: Advanced Techniques March 11, 2009 Slide 14

Multi-level Rule Mining COMP527: Data Mining To avoid the rare item problem, each level in the tree could have a reduced minimum support threshold. Eg level 1 could be 8%, level 2 (more specific) needs a lower threshold of 5%, then 3%, 2%, etc. (And in our graph, it would be path distance, rather than tree level)‏ We need some search strategies to crawl the tree in comparison to the transaction database. ARM: Advanced Techniques March 11, 2009 Slide 15

Multi-level Rule Mining COMP527: Data Mining Level-by-level Independent: full breadth search. May examine a lot of infrequent items! Cross Filtering by Itemset: A k-itemset at (i)th level is examined only if the corresponding k-itemset at (i-1)th level is frequent Might filter out valuable patterns (eg the 20%, 3% issue)‏ Cross Filtering by Item: An item at (i)th level will only be examined if its parent node at (i-1)th level is frequent One compromise between the previous two. ARM: Advanced Techniques March 11, 2009 Slide 16

Multi-level Rule Mining COMP527: Data Mining Controlled Cross Filtering by Single Item: Two thresholds at each level. One for frequency at that level, and one called a level passage threshold. This controls which items can pass down to the next level. If the item doesn't makes the threshold, it doesn't pass down. This threshold is typically between the two levels' support thresholds. None of these address cross-level association rules. Eg rules that link buying items at one level with items at a different level. ARM: Advanced Techniques March 11, 2009 Slide 17

Multi-level Rule Mining COMP527: Data Mining Many similar rules can be generated between different levels. eg: white bread -> skim milk is similar to bread -> milk, and white toast bread -> 2l skim milk and ... If we allow cross levels, these become astronomical. If we allow cross levels, we can have totally redundant rules: • milk -> bread • skim milk -> bread • tesco milk -> bread ARM: Advanced Techniques March 11, 2009 Slide 18

Multi-dimensional Rule Mining COMP527: Data Mining We could mine other dimensions than 'buys', assuming that we have some knowledge about the buyer. For example: • age(20..29) & buys(milk) => buys(bread)‏ • occupation(student) & buys(laptop) => buys(blank dvds)‏ This isn't necessarily any more difficult, it just involves putting these items into the transaction to be mined. Can be usefully combined with meta-rules or constraints. ARM: Advanced Techniques March 11, 2009 Slide 19

Discretization COMP527: Data Mining We have the same 'range' problem we have with numeric data, but in spades. We don't want to classify by it, we want to find arbitrary rules using arbitrary ranges. For example, we might want age() somehow linked to buying.. but we don't know how to discretize it. Equally we might want some sort of distance based association rule, where the distance between data points is important. Either physical (item A is spatially close to B), or similarity (item A is similar to item B)‏ ARM: Advanced Techniques March 11, 2009 Slide 20

Quantity COMP527: Data Mining Not only could we discretize single numeric attributes, we can have a number attached to each item: I might buy 10 cans of cat food, 2 bottles of coke, 3 packets of chicken pieces... We could then look for rules that use this quantity (orthogonally to all of the other dimensions we've looked at). Eg: buys(cat food, 5+) -> buys(cat litter, 1)‏ buys(soda, 2) -> buys(potato chips, 2+)‏ (I feel sympathy for your encroaching headaches!)‏ ARM: Advanced Techniques March 11, 2009 Slide 21

Time COMP527: Data Mining (But not that much sympathy!)‏ You could use association rule mining techniques to find episodic rules. For example that I buy cheese every 3 weeks, milk and bread every week, and dvds apparently randomly. The metric could be number of transactions rather than calendar days/weeks. If the items were a sequence of events, then the order is important in the transaction and that could be mined for rules. Trend rules examine the same attribute over time, eg trends in the stock market. Which could be applied to many attributes concurrently. ARM: Advanced Techniques March 11, 2009 Slide 22

Classification ARM COMP527: Data Mining A final note to say that once association rules have been discovered, they can be used to form a classifier. For example by adding a constraint that the consequent must be one of the attributes that are specified as a class. ARM: Advanced Techniques March 11, 2009 Slide 23

Further Reading COMP527: Data Mining The rest of Zhang! Berry and Browne, Chapters 15, 16 Han 5.3, 5.5 Dunham 6.4, 6.7 ARM: Advanced Techniques March 11, 2009 Slide 24

M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009