Parallel Association Rule Mining

1 / 30

# Parallel Association Rule Mining - PowerPoint PPT Presentation

Parallel Association Rule Mining. Presented by: Ramoza Ahsan and Xiao Qin November 5 th , 2013. Outline. Background of Association Rule Mining Apriori Algorithm Parallel Association Rule Mining Count Distribution Data Distribution Candidate Distribution FP tree Mining and growth

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Parallel Association Rule Mining' - crete

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Parallel Association Rule Mining

Presented by: Ramoza Ahsan and Xiao Qin

November 5th, 2013

Outline
• Background of Association Rule Mining
• Apriori Algorithm
• Parallel Association Rule Mining
• Count Distribution
• Data Distribution
• Candidate Distribution
• FP tree Mining and growth
• Fast Parallel Association Rule mining without candidate generation
Association Rule Mining
• Association rule mining
• Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.)
• Record usually contains transaction date and items bought.
• Literature work more focused on serial mining.
• Support and Confidence: Parameters for Association Rule mining.
Association rule Mining Parameters
• The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset.
• Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X)

Outline
• Background of Association Rule Mining
• Apriori Algorithm
• Parallel Association Rule Mining
• Count Distribution
• Data Distribution
• Candidate Distribution
• FP tree Mining and growth
• Fast Parallel Association Rule mining without candidate generation
Apriori Algorithm

Apriori runs in two steps.

• Generation of candidate itemsets
• Pruning of itemsets which are infrequent

Level-wise generation of frequent itemsets.

Apriori principle:

• If an itemset is frequent, then all of its subsets must also be frequent.
Parallel Association Rule Mining
• Paper presents parallel algorithm for generating frequent itemsets
• Each of N procesor has private memory and disk.
• Data is distributed evenly on the disks of every processor.
• Count Distribution algorithm focusses on minimizing communication.
• Data Distribution utilizes memory aggregation efficiently
• Candidate Distribution reduces synchronization between processors.
Algorithm 1: Count Distribution
• Each processor generates complete Ck,using complete frequent itemset Lk-1.
• Processor traverses over its local data partition and develops local support counts.
• Exchange the counts with other processors to develop global count. Synchronization is needed.
• Each processor computes Lk from Ck.
• Each processor makes a decision to continue or stop.
Algorithm 2: Data Distribution
• Partition the dataset into N small chunks
• Partition the set of candidates k-itemsets into N exclusive subsets.
• Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks.
• Aggregate the count.
Algorithm 2: Data Distribution

Data

1/N Data

Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Data

Algorithm 2: Data Distribution

1/N Data

1/N Data

1/N Ck

1/N Ck

1/N Data

synchronize

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Data

Algorithm 3: Candidates Distribution
• If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass.
• The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.
Algorithm 3: Candidates Distribution

Data

Data_1

Data_2

Lk-1

Lk-1_1

Ck_1

Lk-1_2

Ck_2

Data_3

Lk-1_3

Ck_3

Lk-1_4

Ck_4

Data_4

Lk-1_5

Ck_5

Data_5

Algorithm 3: Candidates Distribution

Data_1

Data_2

Lk-1_1

Ck_1

Data_3

Lk-1_2

Ck_2

Lk-1_3

Ck_3

Data_4

Lk-1_4

Ck_4

Lk-1_5

Ck_5

Data_5

Data Partition and L Partition
• Data
• Each pass, every node grabs the necessary tuples from the dataset.
• L
• Let L3={ABC, ABD, ABE, ACD, ACE}
• The items in the itemsets are lexicographically ordered.
• Partition the itemsets based on common k-1 long prefixes.
Rule Generation
• Ex.
• Frequent Itemset {ABCDE,AB}
• The Rule that can be generated from this set is

AB => CDE

Support : Sup(ABCDE)

Confidence : Sup(ABCDE)/Sup(AB)

Outline
• Background of Association Rule Mining
• Apriori Algorithm
• Parallel Association Rule Mining
• Count Distribution
• Data Distribution
• Candidate Distribution
• FP tree Mining and growth
• Fast Parallel Association Rule mining without candidate generation
FP Tree Algorithm

Allows frequent itemset discovery without candidate itemset generation:

• Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set.
• Step 2: Extracts frequent itemsets directly from the FP-tree
• Phase 1:
• Each processor is given equal number of transactions.
• Each processor locally counts the items.
• Local count is summed up to get global count.
• Infrequent items are pruned and frequent items are stored in header table in descending order of frequency.
• construction of parallel frequent pattern trees for each processor.
• Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.
Example with min supp=4

Step 4. After pruning infrequent ones

Step 1

FP tree for P0

B:1

B:2

B:3

A:1

D:1

A:2

F:1

D:1

D:2

G:1

G:1

Frequent pattern strings
• All frequent pattern trees are shared by all processors
• Each generate conditional pattern base from respective items in header table
• Merging all conditional pattern bases of same item yields frequent string.
• If support of item is less than threshold it is not added in final frequent string.