Advanced topics in data mining
Download
1 / 108

Advanced Topics in Data Mining - PowerPoint PPT Presentation


  • 177 Views
  • Uploaded on

Advanced Topics in Data Mining. Association Rules Sequential Patterns Web Mining. Where to Find References?. Proceedings Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Proceedings of IEEE International Conference on Data Mining (ICDM)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced Topics in Data Mining' - ovidio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Advanced topics in data mining

Advanced Topics in Data Mining

  • Association Rules

  • Sequential Patterns

  • Web Mining


Where to find references
Where to Find References?

  • Proceedings

    • Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    • Proceedings of IEEE International Conference on Data Mining (ICDM)

    • Proceedings of IEEE International Conference on Data Engineering (ICDE)

    • Proceedings of the International Conference on Very Large Data Bases (VLDB)

    • ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery

    • Proceedings of the International Conference on Data Warehousing and Knowledge Discovery


Where to find references1
Where to Find References?

  • Proceedings

    • Proceedings of ACM SIGMOD International Conference on Management of Data

    • Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD)

    • European Conference on Principles of Data Mining and Knowledge Discovery (PKDD)

    • Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA)

    • Proceedings of the International Conference on Database and Expert Systems Applications (DEXA)


Where to find references2
Where to Find References?

  • Journal

    • IEEE Transactions on Knowledge and Data Engineering (TKDE)

    • Data Mining and Knowledge Discovery

    • Journal of Intelligent Information Systems

    • ACM SIGMOD Record

    • The International Journal on Very Large Database

    • Knowledge and Information Systems

    • Data & Knowledge Engineering

    • International Journal of Cooperative Information Systems




What is association mining
What Is Association Mining?

  • Association Rule Mining

    • Finding frequent patterns, associations, correlations, or causal structures among item sets in transaction databases, relational databases, and other information repositories

  • Applications

    • Market basket analysis (marketing strategy: items to put on sale at reduced prices), cross-marketing, catalog design, shelf space layout design, etc

  • Examples

    • Rule form: Body ® Head [Support, Confidence].

    • buys(x, “Computer”) ® buys(x, “Software”) [2%, 60%]

    • major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]


Market basket analysis
Market Basket Analysis

Typically, association rules are considered interesting if they satisfy both

a minimum support threshold and a minimum confidence threshold.


Rule measures support and confidence
Rule Measures: Support and Confidence

  • Let minimum support 50%, and minimum confidence 50%, we have

    • A  C [50%, 66.6%]

    • C  A [50%, 100%]



Association rule basic concepts
Association Rule: Basic Concepts

  • Given

    • (1) database of transactions,

    • (2) each transaction is a list of items (purchased by a customer in a visit)

  • Find all rules that correlate the presence of one set of items with that of another set of items

  • Find all the rules A  B with minimum confidence and support

    • support, s, P(A  B)

    • confidence, c, P(B|A)


Association rule mining a road map
Association Rule Mining:A Road Map

  • Boolean vs. quantitative associations (Based on the types of values handled in the rule set)

    • buys(x, “SQLServer”) ^ buys(x, “DM Book”)  buys(x, “DBMiner”) [0.2%, 60%]

    • age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%]

  • Single dimension vs. multiple dimensional associations

  • Single level vs. multiple-level analysis (Based on the levels of abstractions involved in the rule set)


Terminologies
Terminologies

  • Item

    • I1, I2, I3, …

    • A, B, C, …

  • Itemset

    • {I1}, {I1, I7}, {I2, I3, I5}, …

    • {A}, {A, G}, {B, C, E}, …

  • 1-Itemset

    • {I1}, {I2}, {A}, …

  • 2-Itemset

    • {I1, I7}, {I3, I5}, {A, G}, …


Terminologies1
Terminologies

  • K-Itemset

    • If the length of the itemset is K

  • Frequent K-Itemset

    • If the length of the itemset is K and the itemset satisfies a minimum support threshold.

  • Association Rule

    • If a rule satisfies both a minimum support threshold and a minimum confidence threshold


Analysis
Analysis

  • The number of itemsets of a given cardinality tends to grow exponentially


Mining association rules apriori principle
Mining Association Rules: Apriori Principle

Min. support 50%

Min. confidence 50%

  • For rule A  C:

    • support = support({A C}) = 50%

    • confidence = support({A C})/support({A}) = 66.6%

  • The Apriori principle:

    • Any subset of a frequent itemset must be frequent


Mining frequent itemsets the key step
Mining Frequent Itemsets: the Key Step

  • Find the frequent itemsets: the sets of items that have minimum support

    • A subset of a frequent itemset must also be a frequent itemset

      • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

    • Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

  • Use the frequent itemsets to generate association rules






Example of generating candidates
Example of Generating Candidates

  • L3={abc, abd, acd, ace, bcd}

  • Self-joining: L3*L3

    • abcd from abc and abd

    • acde from acd and ace

  • Pruning:

    • acde is removed because ade is not in L3

  • C4={abcd}


Another example 1

C1 count

1 2

2 3

3 3

4 1

5 3

L1

1

2

3

5

Database D

1 3 4

2 3 5

1 2 3 5

2 5

generate L1

scan D

count C1

C2

12

13

15

23

25

35

C2 count

12 1

13 2

15 1

23 2

25 3

35 2

L2

13

23

25

35

generate C2

generate L2

scan D

count C2

C3

235

C3 count

235 2

L3

235

generate C3

generate L3

scan D

count C3

Another Example 1



Is apriori fast enough performance bottlenecks
Is Apriori Fast Enough? — Performance Bottlenecks

  • The core of the Apriori algorithm:

    • Use frequent (k–1)-itemsets to generate candidate frequent k-itemsets

    • Use database scan to collect counts for the candidate itemsets

  • The bottleneck of Apriori:

    • Huge candidate sets:

      • 104 frequent 1-itemset will generate 107 candidate 2-itemsets

      • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates.

    • Multiple scans of database:

      • Needs (n +1) scans, n is the length of the longest pattern




Methods to improve apriori s efficiency
Methods to Improve Apriori’s Efficiency

  • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent

  • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans

  • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB

  • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness




Compare apriori dhp direct hash pruning
Compare Apriori & DHP (Direct Hash & Pruning)

Apriori




Dhp direct hash pruning
DHP (Direct Hash & Pruning)

  • A database has four transactions

  • Let min_sup = 50%






Mining frequent patterns without candidate generation
Mining Frequent Patterns Without Candidate Generation

  • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure

    • highly condensed, but complete for frequent pattern mining

    • avoid costly database scans

  • Develop an efficient, FP-tree-based frequent pattern mining method

    • A divide-and-conquer methodology: decompose mining tasks into smaller ones

    • Avoid candidate generation & sub-database test only!



Construction steps
Construction Steps

  • Scan DB once, find frequent 1-itemset (single item pattern)

  • Order frequent items in frequency descending order

  • Sorting DB according to the frequency descending order

  • Scan DB again, construct FP-tree


Benefits of the fp tree structure
Benefits of the FP-Tree Structure

  • Completeness

    • never breaks a long pattern of any transaction

    • preserves complete information for frequent pattern mining

  • Compactness

    • reduce irrelevant information—infrequent items are gone

    • frequency descending ordering: more frequent items are more likely to be shared

    • never be larger than the original database (if not count node-links and counts)

    • Compression ratio could be over 100


Frequent pattern growth

For I5

{I1, I5}, {I2, I5}, {I2, I1, I5}

For I4

{I2, I4}

For I3

{I1, I3}, {I2, I3}, {I2, I1, I3}

For I1

{I2, I1}

Frequent Pattern Growth

Order frequent items in frequency descending order


Frequent pattern growth1

For I5

{I1, I5}, {I2, I5}, {I2, I1, I5}

For I4

{I2, I4}

For I3

{I1, I3}, {I2, I3}, {I2, I1, I3}

For I1

{I2, I1}

Frequent Pattern Growth

Sub

DB

Trimming

Databases

Sub

DB

Sub

DB

FP-tree

Sub

DB

Conditional FP-tree from Conditional Pattern-Base


Conditional fp tree
Conditional FP-tree

Conditional FP-tree from

Conditional Pattern-Base

for I3


Mining results using fp tree

For I5 (不產生NULL)

Conditional Pattern Base

{(I2I1:1), (I2I1I3:1)}

Conditional FP-tree

Generate Frequent Itemsets

I2:2 Rule: I2I5:2

I1:2 Rule: I1I5:2

I2I1:2 Rule: I2I1I5:2

Mining Results Using FP-tree

◎ NULL: 2

◎ “I2”: 2

◎ “I1”: 2


Mining results using fp tree1

For I4

Conditional Pattern Base

{(I2I1:1), (I2:1)}

Conditional FP-tree

Generate Frequent Itemsets

I2:2 Rule: I2I4:2

Mining Results Using FP-tree

◎ NULL: 2

◎ “I2”: 2


Mining results using fp tree2

For I3

Conditional Pattern Base

{(I2I1:2), (I2:2), (I1:2)}

Conditional FP-tree

Mining Results Using FP-tree

◎ NULL: 4

◎ “I2”: 4

◎ “I1”: 2

◎ “I1”: 2


Mining results using fp tree3

For I1/I3

Conditional Pattern Base

{(NULL:2), (I2:2)}

Conditional FP-tree

Generate Frequent Itemsets

Null:4 Rule: I1I3:4

I2:2 Rule: I2I1I3:2

Mining Results Using FP-tree

◎ NULL: 4

◎ “I2”: 2


Mining results using fp tree4

For I2/I3

Conditional Pattern Base

{(NULL:4)}

Conditional FP-tree

Generate Frequent Itemsets

Null Rule: I2I3:4

Mining Results Using FP-tree

◎ NULL: 4


Mining results using fp tree5

For I1

Conditional Pattern Base

{(NULL:2), (I2:4)}

Conditional FP-tree

Generate Frequent Itemsets

I2:4 Rule: I2I1:4

Mining Results Using FP-tree

◎ NULL: 6

◎ “I2”: 4



Mining frequent patterns using fp tree
Mining Frequent PatternsUsing FP-tree

  • General idea (divide-and-conquer)

    • Recursively grow frequent pattern path using the FP-tree

  • Method

    • For each item, construct its conditional pattern-base, and then its conditional FP-tree

    • Repeat the process on each newly created conditional FP-tree

    • Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)


Major steps to mine fp tree
Major Steps to Mine FP-tree

  • Construct conditional pattern base for each item in the FP-tree

  • Construct conditional FP-tree from each conditional pattern-base

  • Recursively mine conditional FP-trees and grow frequent patterns obtained so far

    • If the conditional FP-tree contains a single path, simply enumerate all the patterns


Virtual items in association mining
Virtual Items in Association Mining

  • Different region exhibit different selling patterns. Thus, including as virtual item the information on the location or the type of stores (existing or new) where the purchase was made will enable the comparisons between locations or types within a single chain

  • Virtual item may include information on whether the purchase was made with cash, a credit card or check. The inclusion of such virtual item allows to analyze the association between the payment method and items purchased.

  • Virtual item may include information on the day of the week or the time of the day the transaction occurred. The inclusion of such virtual item allows to analyze the association between the transaction time and items purchased



Dissociation rules
Dissociation Rules

  • A dissociation rule is similar to an association rule except that it can have “not item-name” in the condition or the result of the rule

    • A and not B ® C    

    • A and D ® not E

  • Dissociation rules can be generated by a simple adaptation of the association rule analysis


Discussions
Discussions

  • The size of a typical transaction grows because it now includes inverted items

  • The total number of items used in the analysis doubles

    • Since the amount of computation grows exponentially with the number of items, doubling the number of items seriously degrades performance

  • The frequency of the inverted items tends to be much larger than the frequency of the original items. So, it tends to produce rules in which all items are inverted. These rules are less likely to be actionable.

    • not A and not B ® not C

  • It is useful to invert only the most frequent items in the set used for analysis. It is also useful to invert some items whose inverses are of interest.


Interestingness measurements
Interestingness Measurements

  • Subjective Measures

    • A rule (pattern) is interesting if

      • it is unexpected (surprising to the user)

      • actionable (the user can do something with it)

  • Objective Measures

    • Two popular measurements:

      • Support

      • confidence


Criticism to support and confidence
Criticism to Support and Confidence

  • Example 1

    • Among 5000 students

      • 3000 play basketball

      • 3750 eat cereal

      • 2000 both play basket ball and eat cereal

    • play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.


Criticism to support and confidence1
Criticism to Support and Confidence

  • Example 2

    • X and Y: positively correlated,

    • X and Z, negatively related

    • support and confidence of

      X=>Z dominates

  • We need a measure of dependent or correlated events


Criticism to support and confidence2
Criticism to Support and Confidence

  • Improvement (Correlation)

    • Taking both P(A) and P(B) in consideration

    • P(A^B)=P(B)*P(A), if A and B are independent events

    • A and B negatively correlated, if the value is less than 1otherwise A and B positively correlated

    • When improvement is less than 1, negating the result produces a better rule

      • X => NOT Z


Multiple level association rules
Multiple-Level Association Rules

  • Items often form hierarchy

  • Items at the lower level are expected to have lower support

  • Rules regarding itemsets at appropriate levels could be quite useful

  • Transaction database can be encoded based on dimensions and levels

  • We can explore multi-level mining




Mining multi level associations
Mining Multi-Level Associations

  • A top_down, progressive deepening approach

    • First find high-level strong rulesmilk ® bread [20%, 60%]

    • Then find their lower-level “weaker” rules2% milk ® wheat bread [6%, 50%]

  • Variations at mining multiple-level association rules.

    • Cross-level association rules (Generalized Asso. Rules)2% milk ® Wonder wheat bread

    • Association rules with multiple, alternative hierarchies2% milk ® Wonder bread


Multi level association uniform support vs reduced support
Multi-level Association: Uniform Support vs. Reduced Support

  • Uniform Support: the same minimum support for all levels

    • One minimum support threshold is needed.

    • Lower level items do not occur as frequently. If support threshold

      • too high  miss low level associations

      • too low  generate too many high level associations

  • Reduced Support: reduced minimum support at lower levels

    • There are 4 search strategies:

      • Level-by-level independent

      • Level-cross filtering by k-itemset

      • Level-cross filtering by single item

      • Controlled level-cross filtering by single item


Uniform support
Uniform Support

  • Optimization Technique

    • The search avoids examining itemsets containing any item whose ancestors do not have minimum support.


Uniform support an example
Uniform Support : An Example

L1: 2,3,4,5,6

L2: 23,24,25,26,34,45,46,56

L3: 234,245,246,256,456

L4: 2456

min_sup=50%

L1: 2,3,4

L2: 23,24,34

L3: 234

min_sup=60%

L1: 2,3,6,8,9

L2: 23,68,69,89

L3: 689

min_sup=50%

L1: 9

min_sup=60%


Uniform support an example1
Uniform Support : An Example

All

Cheese

Crab

Milk

3

1

2

1

2

6

3

5

King’s Crab

Sunset Milk

Dairyland Milk

Dairyland Cheese

Best Cheese

Bread

Apple

Pie

4

5

6

4

9

7

10

8

Best Bread

Wonder Bread

Goldenfarm Apple

Westcoast Bread

Tasty Pie


Uniform support an example2
Uniform Support : An Example

L1: 2,3,4,5,6

L2: 23,24,25,26,34,45,46,56

L3: 234,245,246,256,456

L4: 2456

min_sup=50%

Apriori/DHP

FP Growth

(1)

(2)

min_sup=50%

L1: 2,3,6,8,9

Scan DB

C2: 23,36,29,69,28,68,39,89

C3: 239,369,289,689

(3)

min_sup=50%

Scan DB

L2: 23,68,69,89

L3: 689


Uniform support an example3
Uniform Support : An Example

All

Cheese

Crab

Milk

3

1

2

1

2

6

3

5

King’s Crab

Sunset Milk

Dairyland Milk

Dairyland Cheese

Best Cheese

Bread

Apple

Pie

4

5

6

4

9

7

10

8

Best Bread

Wonder Bread

Goldenfarm Apple

Westcoast Bread

Tasty Pie


Reduced support
Reduced Support

  • Each level of abstraction has its own minimum support threshold.


Search strategies for reduced support
Search Strategies forReduced Support

  • There are 4 search strategies:

    • Level-by-level independent

      • Full-Breadth Search

      • No pruning

        • No background knowledge of frequent itemsets is used for pruning

    • Level-cross filtering by single item

      • An item at the ith level is examined if and only if its parent node at the (i-1)th level is frequent.

    • Level-cross filtering by k-itemset

      • An k-itemset at the ith level is examined if and only if its corresponding parent k-itemset at the (i-1)th level is frequent.

    • Controlled level-cross filtering by single item



Reduced support an example
Reduced Support : An Example

L1: 2,3,4

L2: 23,24,34

L3: 234

min_sup=60%

Apriori/DHP

FP Growth

(1)

(1)

min_sup=50%

L1: 2,3,6,9

Scan DB

(2)

min_sup=50%

L2: 23,69

Apriori/DHP

FP Growth

(3)


Reduced support an example1
Reduced Support : An Example

All

Cheese

Crab

Milk

3

1

2

1

2

6

3

5

King’s Crab

Sunset Milk

Dairyland Milk

Dairyland Cheese

Best Cheese

Bread

Apple

Pie

4

5

6

4

9

7

10

8

Best Bread

Wonder Bread

Goldenfarm Apple

Westcoast Bread

Tasty Pie


Level cross filtering by k itemset
Level-Cross Filtering by K-Itemset


Reduced support an example2
Reduced Support : An Example

L1: 2,3,4

L2: 23,24,34

L3: 234

min_sup=60%

Apriori/DHP

FP Growth

(1)

(2)

min_sup=50%

L1: 2,3,6,9

Scan DB

C2: 23,36,29,69,39

C3: 239,369

(3)

min_sup=50%

Scan DB

L2: 23,69


Reduced support an example3
Reduced Support : An Example

All

Cheese

Crab

Milk

3

1

2

1

2

6

3

5

King’s Crab

Sunset Milk

Dairyland Milk

Dairyland Cheese

Best Cheese

Bread

Apple

Pie

4

5

6

4

9

7

10

8

Best Bread

Wonder Bread

Goldenfarm Apple

Westcoast Bread

Tasty Pie


Reduced support1
Reduced Support

  • Level-by-level independent

    • It is very relaxed in that it may lead to examining numerous infrequent items at low levels, finding associations between items of little importance.

  • Level-cross filtering by k-itemset

    • It allows the mining system to examine only the children of frequent k-itemsets.

    • This restriction is very strong in that there usually are not many k-itemsets.

    • Many valuable patterns may be filtered out.

  • Level-cross filtering by single item

    • A compromise between the above two approaches

    • This method may miss associations between low level items that are frequent based on a reduced minimum support, but whose ancestors do not satisfy minimum support.



Reduced support an example4
Reduced Support : An Example

L1: 2,3,4

L2: 23,24,34

L3: 234

min_sup=60%

Apriori/DHP

FP Growth

level_passage_sup=50%

L1: 2,3,4,5,6

(1)

(1)

min_sup=50%

L1: 2,3,6,8,9

Scan DB

(2)

L2: 23,68,69,89

L3: 689

min_sup=50%

Apriori/DHP

FP Growth

(3)


Reduced support an example5
Reduced Support : An Example

All

Cheese

Crab

Milk

3

1

2

1

2

6

3

5

King’s Crab

Sunset Milk

Dairyland Milk

Dairyland Cheese

Best Cheese

Bread

Apple

Pie

4

5

6

4

9

7

10

8

Best Bread

Wonder Bread

Goldenfarm Apple

Westcoast Bread

Tasty Pie


Multi dimensional association
Multi-Dimensional Association

  • Single-Dimensional (Intra-Dimension) Rules: Single Dimension (Predicate) with Multiple Occurrences.

    • buys(X, “milk”)  buys(X, “bread”)

  • Multi-Dimensional Rules:  2 Dimensions

    • Inter-dimension association rules (no repeated predicates)

      • age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)

    • hybrid-dimension association rules (repeated predicates)

      • age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)


Summary
Summary

  • Association rule mining

    • Probably the most significant contribution from the database community in KDD

    • A large number of papers have been published

  • Some Important Issues

    • Generalized Association Rules

    • Multiple-Level Association Rules

    • Association Analysis in Other Types of Data

      • Spatial Data, Multimedia Data, Time Series Data, etc.

    • Weighted Association Rules

    • Quantitative Association Rules


Weighted association rules
Weighted Association Rules

  • Why Weighted Association Analysis?

    • In previous work, all items in a transactional database are treated uniformly

    • Items are given weights to reflect their importance to the user

    • The weights may correspond to special promotions on some products, or the profitability of different items

      • Some products may be under promotion and hence are more interesting, or some products are more profitable and hence rules concerning them are of greater values


Weighted association rules1
Weighted Association Rules

  • A simple attempt to solve this problem is to eliminate the items with small weights

    • However, a rule for a heavy weighted item may also consist of low weighted items

  • Is Apriori algorithm feasible?

    • Apriori algorithm depends on the downward closure property which governs that subsets of a frequent itemset are also frequent

    • However, it is not true for the weighted case


Weighted association rules an example
Weighted Association Rules:An Example

  • Total Benefits: 500

    • Benefits for the First Transaction: (40+30+30+20+20+10+ 10)=160

    • Benefits for the Second Transaction: (40+30+20+ 20+10+10)=130

    • Benefits for the Third Transaction: (40+30+20+10+10)= 110

    • Benefits for the Fourth Transaction: (30+30+20+10+10)=100

  • Suppose Weighted_Min_Sup = 40%

    • Minimum Benefits = 500 * 40% = 200


An example
An Example

  • Minimum Benefits = 500 * 40% = 200

  • Itemset {3,5,6,7}

    • Benefits: 70

    • Support Count (Frequency): 3

    • 70 * 3 = 210 >= 200 {3,5,6,7}is a Frequent Itemset

  • Itemset {3,5,6}

    • Benefits: 60

    • Support Count ( Frequency): 3

    • 60 * 3 = 180 < 200 {3,5,6} is not a Frequent Itemset

Apriori Principle can not be applied!


K support bound
K-Support Bound

  • If Y is a frequent q-itemset

    • Support_Count(Y) (Weighted_Min_Sup * Total_Benefits) / Benefits(Y)

  • Example

    • {3,5,6,7} is a Frequent 4-Itemset

      • Support_Count({3,5,6,7}) = 3  (40% * 500) / Benefits({3,5,6,7}) = (40% * 500) / 70 = 2.857

  • If X is a frequent k-itemset containing q-itemset Y

    • Minimum_Support_Count(X) (Weighted_Min_Sup * Total_Benefits) / (Benefits(Y) + (k-q) Maximum Remaining Weights)

  • Example

    • X is a Frequent 5-Itemset containing {3,5,6,7}

      • Minimum_Support_Count(X)  (40% * 500) / (70 + 40) = 1.81

K-Support Bound


K support bound1
K-Support Bound

  • Itemset {1,2}

    • Benefits: 70

    • Support_Count({1,2}) = 1 < (40% * 500) / Benefits({1,2}) = (40% * 500) / 70 = 2.857

    • {1,2} is not a Frequent Itemset

  • If X is a frequent 3-itemset containing {1,2}

    • Minimum_Support_Count(X)  (40% * 500) / (70 + 30) = 2

    • But, Maximum_Support_Count(X) = 1

    • No frequent 3-itemsets containing {1,2}

  • If X is a frequent 4-itemset containing {1,2}

    • Minimum_Support_Count(X)  (40% * 500) / (70 + 30 + 20) = 1.667

    • But, Maximum_Support_Count(X) = 1

    • No frequent 4-itemsets containing {1,2}

  • Similarly, no frequent 5, 6, 7-itemsets containing {1,2}

  • The algorithm is designed based on this k-support bound



Step by step
Step by Step

  • Input: Product & Transactional Databases

Weighted_Min_Sup = 50%

Total Profits: 1380


Step 2 7
Step 2, 7

  • Search(D)

    • This subroutine finds out the maximum transaction size in that transactional database D

      • Size = 4 in this case

  • Counting(D, w)

    • This subroutine cumulates the support counts of the 1-itemsets

    • The k-support bounds of each 1-itemset will be calculated, and the 1-itemsets with support counts greater than any of the k-support bounds will be kept in C1


Step 7
Step 7

K-Support Bound  (50% * 1380) / (10 + 90) = 6.9


Step 11
Step 11

  • Join(Ck-1)

    • The Join step generates Ck from Ck-1 as Apriori Algorithm

      • If we have {1, 2, 3}, {1, 2, 4} in Ck-1 {1, 2, 3, 4} will be generated in Ck

    • In this case,

      • C1 = {1 (4), 2 (5), 4 (6), 5 (7)}

      • C2 = Join(C1) = {12, 14, 15, 24, 25, 45}

Support_Count


Step 12
Step 12

  • Prune(Ck)

    • The itemset will be pruned in either of the following cases

      • A subset of the candidate itemset in Ck does not exist in Ck-1

      • Estimate an upper bound on the support count (SC) of the joined itemset X, which is the minimum support count among the k different (k-1)-subsets of X in Ck-1. If the estimated upper bound on the support count shows that the itemset X cannot be a subset of any large itemset in the coming passes (from the calculation of k-support bounds for all itemsets), that itemset will be pruned

    • In this case,

      • C2 = Prune(C2) = {12 (4), 14 (4), 15 (4), 24 (5), 25 (5), 45 (6)}

      • Using K-Support-Bound

Estimated_Support_Count


Step 121
Step 12

  • Prune(Ck)

    • Using K-Support-Bound (No one is pruned)


Step 13
Step 13

  • Checking(Ck, D)

    • Scan DB & Generate Lk

C2

L2


Step 11 12
Step 11, 12

  • Join(C2)

    • C2 = {15 (4), 24 (5), 25 (5), 45 (6)}

    • C3 = Join(C2) = {245}

  • Prune(C3)

    • C3 = Prune(C3) = {245 (5)}

    • Using K-Support-Bound (No one is pruned)


Step 131
Step 13

  • Checking(C3, D): Scan DB

  • C3 = L3 = {245}

  • Finally, L = {45, 245}


Step 15
Step 15

  • Generate Rules for L = {45, 245}

    • 4  5 (confidence = 100%)

    • 5  4 (confidence = 85.7%)

    • 24  5 (confidence = 100%)

    • 25  4 (confidence = 100%)

    • 45  2 (confidence = 83.3%)

    • 2  45 (confidence = 100%)

    • 4  25 (confidence = 83.3%)

    • 5  24 (confidence = 71.4%)

Min_Conf=90%



Quantitative association rules
Quantitative Association Rules

  • Let min_sup = 50%, we have

    • {A, B} 60%

    • {B, D} 70%

    • {A, B, D} 50%

  • {A(1..2), B(3)} 50%

  • {A(1..2), B(3..5)} 60%

  • {A(1..2), B(1..5)} 60%

  • {B(1..5), D(1..2)} 60%

  • {B(3..5), D(1..3)} 60%

  • {B(1..5), D(1..3)} 70%

  • {B(1..3), D(1..2)} 50%

  • {B(3..5), D(1..2)} 50%

  • {A(1..2), B(3..5), D(1..2)} 50%


ad