Parallel association rule mining
Download
1 / 30

Parallel Association Rule Mining - PowerPoint PPT Presentation


  • 258 Views
  • Uploaded on

Parallel Association Rule Mining. Presented by: Ramoza Ahsan and Xiao Qin November 5 th , 2013. Outline. Background of Association Rule Mining Apriori Algorithm Parallel Association Rule Mining Count Distribution Data Distribution Candidate Distribution FP tree Mining and growth

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Parallel Association Rule Mining' - crete


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Parallel association rule mining

Parallel Association Rule Mining

Presented by: Ramoza Ahsan and Xiao Qin

November 5th, 2013


Outline
Outline

  • Background of Association Rule Mining

  • Apriori Algorithm

  • Parallel Association Rule Mining

  • Count Distribution

  • Data Distribution

  • Candidate Distribution

  • FP tree Mining and growth

  • Fast Parallel Association Rule mining without candidate generation

  • More Readings


Association rule mining
Association Rule Mining

  • Association rule mining

    • Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.)

  • Record usually contains transaction date and items bought.

  • Literature work more focused on serial mining.

  • Support and Confidence: Parameters for Association Rule mining.


Association rule mining parameters
Association rule Mining Parameters

  • The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset.

  • Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X)

Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.5


Outline1
Outline

  • Background of Association Rule Mining

  • Apriori Algorithm

  • Parallel Association Rule Mining

  • Count Distribution

  • Data Distribution

  • Candidate Distribution

  • FP tree Mining and growth

  • Fast Parallel Association Rule mining without candidate generation

  • FP tree over Hadoop


Apriori algorithm
Apriori Algorithm

Apriori runs in two steps.

  • Generation of candidate itemsets

  • Pruning of itemsets which are infrequent

    Level-wise generation of frequent itemsets.

    Apriori principle:

  • If an itemset is frequent, then all of its subsets must also be frequent.



Parallel association rule mining1
Parallel Association Rule Mining

  • Paper presents parallel algorithm for generating frequent itemsets

  • Each of N procesor has private memory and disk.

  • Data is distributed evenly on the disks of every processor.

  • Count Distribution algorithm focusses on minimizing communication.

  • Data Distribution utilizes memory aggregation efficiently

  • Candidate Distribution reduces synchronization between processors.


Algorithm 1 count distribution
Algorithm 1: Count Distribution

  • Each processor generates complete Ck,using complete frequent itemset Lk-1.

  • Processor traverses over its local data partition and develops local support counts.

  • Exchange the counts with other processors to develop global count. Synchronization is needed.

  • Each processor computes Lk from Ck.

  • Each processor makes a decision to continue or stop.


Algorithm 2 data distribution
Algorithm 2: Data Distribution

  • Partition the dataset into N small chunks

  • Partition the set of candidates k-itemsets into N exclusive subsets.

  • Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks.

  • Aggregate the count.


Algorithm 2 data distribution1
Algorithm 2: Data Distribution

Data

1/N Data

Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Ck

1/N Data

1/N Data


Algorithm 2 data distribution2
Algorithm 2: Data Distribution

1/N Data

1/N Data

1/N Ck

1/N Ck

1/N Data

synchronize

1/N Ck

1/N Ck

1/N Data

1/N Ck

1/N Data


Algorithm 3 candidates distribution
Algorithm 3: Candidates Distribution

  • If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass.

  • The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.


Algorithm 3 candidates distribution1
Algorithm 3: Candidates Distribution

Data

Data_1

Data_2

Lk-1

Lk-1_1

Ck_1

Lk-1_2

Ck_2

Data_3

Lk-1_3

Ck_3

Lk-1_4

Ck_4

Data_4

Lk-1_5

Ck_5

Data_5


Algorithm 3 candidates distribution2
Algorithm 3: Candidates Distribution

Data_1

Data_2

Lk-1_1

Ck_1

Data_3

Lk-1_2

Ck_2

Lk-1_3

Ck_3

Data_4

Lk-1_4

Ck_4

Lk-1_5

Ck_5

Data_5


Data partition and l partition
Data Partition and L Partition

  • Data

    • Each pass, every node grabs the necessary tuples from the dataset.

  • L

    • Let L3={ABC, ABD, ABE, ACD, ACE}

    • The items in the itemsets are lexicographically ordered.

    • Partition the itemsets based on common k-1 long prefixes.


Rule generation
Rule Generation

  • Ex.

    • Frequent Itemset {ABCDE,AB}

    • The Rule that can be generated from this set is

      AB => CDE

      Support : Sup(ABCDE)

      Confidence : Sup(ABCDE)/Sup(AB)


Outline2
Outline

  • Background of Association Rule Mining

  • Apriori Algorithm

  • Parallel Association Rule Mining

  • Count Distribution

  • Data Distribution

  • Candidate Distribution

  • FP tree Mining and growth

  • Fast Parallel Association Rule mining without candidate generation

  • FP tree over Hadoop


Fp tree algorithm
FP Tree Algorithm

Allows frequent itemset discovery without candidate itemset generation:

  • Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set.

  • Step 2: Extracts frequent itemsets directly from the FP-tree



Fast parallel association rule mining without candidacy generation
Fast Parallel Association Rule Mining Without Candidacy Generation

  • Phase 1:

    • Each processor is given equal number of transactions.

    • Each processor locally counts the items.

    • Local count is summed up to get global count.

    • Infrequent items are pruned and frequent items are stored in header table in descending order of frequency.

    • construction of parallel frequent pattern trees for each processor.

  • Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.


Example with min supp 4
Example with min Generationsupp=4

Step 4. After pruning infrequent ones

Step 1


Fp tree for p 0
FP tree for P Generation0

B:1

B:2

B:3

A:1

D:1

A:2

F:1

D:1

D:2

G:1

G:1




Frequent pattern strings
Frequent pattern strings Generation

  • All frequent pattern trees are shared by all processors

  • Each generate conditional pattern base from respective items in header table

  • Merging all conditional pattern bases of same item yields frequent string.

  • If support of item is less than threshold it is not added in final frequent string.


More readings
More Readings Generation

[1]

[2]

[3]

[4]


Fp growth on hadoop
FP-Growth on GenerationHadoop

3 Map-Reduce(s)


Fp growth on hadoop1
FP-Growth on GenerationHadoop

Core


Thank you
Thank You! Generation


ad