1 / 30

# Parallel Association Rule Mining - PowerPoint PPT Presentation

Parallel Association Rule Mining. Presented by: Ramoza Ahsan and Xiao Qin November 5 th , 2013. Outline. Background of Association Rule Mining Apriori Algorithm Parallel Association Rule Mining Count Distribution Data Distribution Candidate Distribution FP tree Mining and growth

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Parallel Association Rule Mining' - crete

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Parallel Association Rule Mining

Presented by: Ramoza Ahsan and Xiao Qin

November 5th, 2013

• Background of Association Rule Mining

• Apriori Algorithm

• Parallel Association Rule Mining

• Count Distribution

• Data Distribution

• Candidate Distribution

• FP tree Mining and growth

• Fast Parallel Association Rule mining without candidate generation

• Association rule mining

• Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.)

• Record usually contains transaction date and items bought.

• Literature work more focused on serial mining.

• Support and Confidence: Parameters for Association Rule mining.

• The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset.

• Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X)

• Background of Association Rule Mining

• Apriori Algorithm

• Parallel Association Rule Mining

• Count Distribution

• Data Distribution

• Candidate Distribution

• FP tree Mining and growth

• Fast Parallel Association Rule mining without candidate generation

Apriori runs in two steps.

• Generation of candidate itemsets

• Pruning of itemsets which are infrequent

Level-wise generation of frequent itemsets.

Apriori principle:

• If an itemset is frequent, then all of its subsets must also be frequent.

• Minimum support=2

• Paper presents parallel algorithm for generating frequent itemsets

• Each of N procesor has private memory and disk.

• Data is distributed evenly on the disks of every processor.

• Count Distribution algorithm focusses on minimizing communication.

• Data Distribution utilizes memory aggregation efficiently

• Candidate Distribution reduces synchronization between processors.

• Each processor generates complete Ck,using complete frequent itemset Lk-1.

• Processor traverses over its local data partition and develops local support counts.

• Exchange the counts with other processors to develop global count. Synchronization is needed.

• Each processor computes Lk from Ck.

• Each processor makes a decision to continue or stop.

• Partition the dataset into N small chunks

• Partition the set of candidates k-itemsets into N exclusive subsets.

• Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks.

• Aggregate the count.

Data

1/N Data

Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Data

1/N Data

1/N Data

1/N Ck

1/N Ck

1/N Data

synchronize

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Data

• If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass.

• The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.

Data

Data_1

Data_2

Lk-1

Lk-1_1

Ck_1

Lk-1_2

Ck_2

Data_3

Lk-1_3

Ck_3

Lk-1_4

Ck_4

Data_4

Lk-1_5

Ck_5

Data_5

Data_1

Data_2

Lk-1_1

Ck_1

Data_3

Lk-1_2

Ck_2

Lk-1_3

Ck_3

Data_4

Lk-1_4

Ck_4

Lk-1_5

Ck_5

Data_5

• Data

• Each pass, every node grabs the necessary tuples from the dataset.

• L

• Let L3={ABC, ABD, ABE, ACD, ACE}

• The items in the itemsets are lexicographically ordered.

• Partition the itemsets based on common k-1 long prefixes.

• Ex.

• Frequent Itemset {ABCDE,AB}

• The Rule that can be generated from this set is

AB => CDE

Support : Sup(ABCDE)

Confidence : Sup(ABCDE)/Sup(AB)

• Background of Association Rule Mining

• Apriori Algorithm

• Parallel Association Rule Mining

• Count Distribution

• Data Distribution

• Candidate Distribution

• FP tree Mining and growth

• Fast Parallel Association Rule mining without candidate generation

Allows frequent itemset discovery without candidate itemset generation:

• Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set.

• Step 2: Extracts frequent itemsets directly from the FP-tree

Min supp=3

• Phase 1:

• Each processor is given equal number of transactions.

• Each processor locally counts the items.

• Local count is summed up to get global count.

• Infrequent items are pruned and frequent items are stored in header table in descending order of frequency.

• construction of parallel frequent pattern trees for each processor.

• Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.

Example with min Generationsupp=4

Step 4. After pruning infrequent ones

Step 1

FP tree for P Generation0

B:1

B:2

B:3

A:1

D:1

A:2

F:1

D:1

D:2

G:1

G:1

Frequent pattern strings Generation

• All frequent pattern trees are shared by all processors

• Each generate conditional pattern base from respective items in header table

• Merging all conditional pattern bases of same item yields frequent string.

• If support of item is less than threshold it is not added in final frequent string.

[1]

[2]

[3]

[4]