- By
**crete** - Follow User

- 259 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Parallel Association Rule Mining' - crete

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Background of Association Rule Mining
- Apriori Algorithm
- Parallel Association Rule Mining
- Count Distribution
- Data Distribution
- Candidate Distribution
- FP tree Mining and growth
- Fast Parallel Association Rule mining without candidate generation
- More Readings

Association Rule Mining

- Association rule mining
- Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.)
- Record usually contains transaction date and items bought.
- Literature work more focused on serial mining.
- Support and Confidence: Parameters for Association Rule mining.

Association rule Mining Parameters

- The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset.
- Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X)

Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.5

Outline

- Background of Association Rule Mining
- Apriori Algorithm
- Parallel Association Rule Mining
- Count Distribution
- Data Distribution
- Candidate Distribution
- FP tree Mining and growth
- Fast Parallel Association Rule mining without candidate generation
- FP tree over Hadoop

Apriori Algorithm

Apriori runs in two steps.

- Generation of candidate itemsets
- Pruning of itemsets which are infrequent

Level-wise generation of frequent itemsets.

Apriori principle:

- If an itemset is frequent, then all of its subsets must also be frequent.

Apriori Algorithm for generating frequent itemsets

- Minimum support=2

Parallel Association Rule Mining

- Paper presents parallel algorithm for generating frequent itemsets
- Each of N procesor has private memory and disk.
- Data is distributed evenly on the disks of every processor.
- Count Distribution algorithm focusses on minimizing communication.
- Data Distribution utilizes memory aggregation efficiently
- Candidate Distribution reduces synchronization between processors.

Algorithm 1: Count Distribution

- Each processor generates complete Ck,using complete frequent itemset Lk-1.
- Processor traverses over its local data partition and develops local support counts.
- Exchange the counts with other processors to develop global count. Synchronization is needed.
- Each processor computes Lk from Ck.
- Each processor makes a decision to continue or stop.

Algorithm 2: Data Distribution

- Partition the dataset into N small chunks
- Partition the set of candidates k-itemsets into N exclusive subsets.
- Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks.
- Aggregate the count.

Algorithm 2: Data Distribution

Data

1/N Data

Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Data

Algorithm 2: Data Distribution

1/N Data

1/N Data

1/N Ck

1/N Ck

1/N Data

synchronize

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Data

Algorithm 3: Candidates Distribution

- If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass.
- The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.

Algorithm 3: Candidates Distribution

Data

Data_1

Data_2

Lk-1

Lk-1_1

Ck_1

Lk-1_2

Ck_2

Data_3

Lk-1_3

Ck_3

Lk-1_4

Ck_4

Data_4

Lk-1_5

Ck_5

Data_5

Algorithm 3: Candidates Distribution

Data_1

Data_2

Lk-1_1

Ck_1

Data_3

Lk-1_2

Ck_2

Lk-1_3

Ck_3

Data_4

Lk-1_4

Ck_4

Lk-1_5

Ck_5

Data_5

Data Partition and L Partition

- Data
- Each pass, every node grabs the necessary tuples from the dataset.
- L
- Let L3={ABC, ABD, ABE, ACD, ACE}
- The items in the itemsets are lexicographically ordered.
- Partition the itemsets based on common k-1 long prefixes.

Rule Generation

- Ex.
- Frequent Itemset {ABCDE,AB}
- The Rule that can be generated from this set is

AB => CDE

Support : Sup(ABCDE)

Confidence : Sup(ABCDE)/Sup(AB)

Outline

- Background of Association Rule Mining
- Apriori Algorithm
- Parallel Association Rule Mining
- Count Distribution
- Data Distribution
- Candidate Distribution
- FP tree Mining and growth
- Fast Parallel Association Rule mining without candidate generation
- FP tree over Hadoop

FP Tree Algorithm

Allows frequent itemset discovery without candidate itemset generation:

- Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set.
- Step 2: Extracts frequent itemsets directly from the FP-tree

FP-Tree & FP-Growth example

Min supp=3

Fast Parallel Association Rule Mining Without Candidacy Generation

- Phase 1:
- Each processor is given equal number of transactions.
- Each processor locally counts the items.
- Local count is summed up to get global count.
- Infrequent items are pruned and frequent items are stored in header table in descending order of frequency.
- construction of parallel frequent pattern trees for each processor.
- Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.

Frequent pattern strings

- All frequent pattern trees are shared by all processors
- Each generate conditional pattern base from respective items in header table
- Merging all conditional pattern bases of same item yields frequent string.
- If support of item is less than threshold it is not added in final frequent string.

FP-Growth on Hadoop

3 Map-Reduce(s)

Download Presentation

Connecting to Server..