parallel association rule mining
Download
Skip this Video
Download Presentation
Parallel Association Rule Mining

Loading in 2 Seconds...

play fullscreen
1 / 30

Parallel Association Rule Mining - PowerPoint PPT Presentation


  • 259 Views
  • Uploaded on

Parallel Association Rule Mining. Presented by: Ramoza Ahsan and Xiao Qin November 5 th , 2013. Outline. Background of Association Rule Mining Apriori Algorithm Parallel Association Rule Mining Count Distribution Data Distribution Candidate Distribution FP tree Mining and growth

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Parallel Association Rule Mining' - crete


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
parallel association rule mining

Parallel Association Rule Mining

Presented by: Ramoza Ahsan and Xiao Qin

November 5th, 2013

outline
Outline
  • Background of Association Rule Mining
  • Apriori Algorithm
  • Parallel Association Rule Mining
  • Count Distribution
  • Data Distribution
  • Candidate Distribution
  • FP tree Mining and growth
  • Fast Parallel Association Rule mining without candidate generation
  • More Readings
association rule mining
Association Rule Mining
  • Association rule mining
    • Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.)
  • Record usually contains transaction date and items bought.
  • Literature work more focused on serial mining.
  • Support and Confidence: Parameters for Association Rule mining.
association rule mining parameters
Association rule Mining Parameters
  • The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset.
  • Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X)

Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.5

outline1
Outline
  • Background of Association Rule Mining
  • Apriori Algorithm
  • Parallel Association Rule Mining
  • Count Distribution
  • Data Distribution
  • Candidate Distribution
  • FP tree Mining and growth
  • Fast Parallel Association Rule mining without candidate generation
  • FP tree over Hadoop
apriori algorithm
Apriori Algorithm

Apriori runs in two steps.

  • Generation of candidate itemsets
  • Pruning of itemsets which are infrequent

Level-wise generation of frequent itemsets.

Apriori principle:

  • If an itemset is frequent, then all of its subsets must also be frequent.
parallel association rule mining1
Parallel Association Rule Mining
  • Paper presents parallel algorithm for generating frequent itemsets
  • Each of N procesor has private memory and disk.
  • Data is distributed evenly on the disks of every processor.
  • Count Distribution algorithm focusses on minimizing communication.
  • Data Distribution utilizes memory aggregation efficiently
  • Candidate Distribution reduces synchronization between processors.
algorithm 1 count distribution
Algorithm 1: Count Distribution
  • Each processor generates complete Ck,using complete frequent itemset Lk-1.
  • Processor traverses over its local data partition and develops local support counts.
  • Exchange the counts with other processors to develop global count. Synchronization is needed.
  • Each processor computes Lk from Ck.
  • Each processor makes a decision to continue or stop.
algorithm 2 data distribution
Algorithm 2: Data Distribution
  • Partition the dataset into N small chunks
  • Partition the set of candidates k-itemsets into N exclusive subsets.
  • Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks.
  • Aggregate the count.
algorithm 2 data distribution1
Algorithm 2: Data Distribution

Data

1/N Data

Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Data

algorithm 2 data distribution2
Algorithm 2: Data Distribution

1/N Data

1/N Data

1/N Ck

1/N Ck

1/N Data

synchronize

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Data

algorithm 3 candidates distribution
Algorithm 3: Candidates Distribution
  • If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass.
  • The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.
algorithm 3 candidates distribution1
Algorithm 3: Candidates Distribution

Data

Data_1

Data_2

Lk-1

Lk-1_1

Ck_1

Lk-1_2

Ck_2

Data_3

Lk-1_3

Ck_3

Lk-1_4

Ck_4

Data_4

Lk-1_5

Ck_5

Data_5

algorithm 3 candidates distribution2
Algorithm 3: Candidates Distribution

Data_1

Data_2

Lk-1_1

Ck_1

Data_3

Lk-1_2

Ck_2

Lk-1_3

Ck_3

Data_4

Lk-1_4

Ck_4

Lk-1_5

Ck_5

Data_5

data partition and l partition
Data Partition and L Partition
  • Data
    • Each pass, every node grabs the necessary tuples from the dataset.
  • L
    • Let L3={ABC, ABD, ABE, ACD, ACE}
    • The items in the itemsets are lexicographically ordered.
    • Partition the itemsets based on common k-1 long prefixes.
rule generation
Rule Generation
  • Ex.
    • Frequent Itemset {ABCDE,AB}
    • The Rule that can be generated from this set is

AB => CDE

Support : Sup(ABCDE)

Confidence : Sup(ABCDE)/Sup(AB)

outline2
Outline
  • Background of Association Rule Mining
  • Apriori Algorithm
  • Parallel Association Rule Mining
  • Count Distribution
  • Data Distribution
  • Candidate Distribution
  • FP tree Mining and growth
  • Fast Parallel Association Rule mining without candidate generation
  • FP tree over Hadoop
fp tree algorithm
FP Tree Algorithm

Allows frequent itemset discovery without candidate itemset generation:

  • Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set.
  • Step 2: Extracts frequent itemsets directly from the FP-tree
fast parallel association rule mining without candidacy generation
Fast Parallel Association Rule Mining Without Candidacy Generation
  • Phase 1:
    • Each processor is given equal number of transactions.
    • Each processor locally counts the items.
    • Local count is summed up to get global count.
    • Infrequent items are pruned and frequent items are stored in header table in descending order of frequency.
    • construction of parallel frequent pattern trees for each processor.
  • Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.
example with min supp 4
Example with min supp=4

Step 4. After pruning infrequent ones

Step 1

fp tree for p 0
FP tree for P0

B:1

B:2

B:3

A:1

D:1

A:2

F:1

D:1

D:2

G:1

G:1

frequent pattern strings
Frequent pattern strings
  • All frequent pattern trees are shared by all processors
  • Each generate conditional pattern base from respective items in header table
  • Merging all conditional pattern bases of same item yields frequent string.
  • If support of item is less than threshold it is not added in final frequent string.
more readings
More Readings

[1]

[2]

[3]

[4]

fp growth on hadoop
FP-Growth on Hadoop

3 Map-Reduce(s)

ad