1 / 21

Online Mining (Recently) Maximal Frequent Itemsets over Data Streams

Online Mining (Recently) Maximal Frequent Itemsets over Data Streams. Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor : 柯佳伶. Introduction. Difficulties of Data Stream Mining Huge High speed Continuous

maylin
Download Presentation

Online Mining (Recently) Maximal Frequent Itemsets over Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA’05 speaker:董原賓 Advisor:柯佳伶

  2. Introduction • Difficulties of Data Stream Mining • Huge • High speed • Continuous • Solution:one-pass algorithm • Summary data structure • Mines the maximal frequent itemsets

  3. W1 abc bcd acd W2 cd abd bc WN a b cd ··· time Definition • Ψ= {i1, i2, …, in} :a set of items • Wi:basic window i • Data stream= [W1, W2, …, WN):an infinite sequence of basic windows • N:the window identifier of the latest basic window • Current length of data stream (CL) = |W1| + |W2| +…+ |WN| CL = 3xN

  4. Definition • X.tsup:true supportof itemset X • X.esup:estimated supportof itemset X, 1 ≤ X.esup ≤ X.tsup • X.CL = |Wj|+|Wj+1|+…+|WN| • Wj:the first window containing X in the summary data structure • S:minimum support • ε:maximum support error threshold

  5. Data Stream Mining for maximal Frequent Itemsets (DSM-MFI) • Step1, reads a window of transactions • Step2, constructs and maintains the summary data structure • Step3, prunes the infrequent information • Step4, searches the maximal frequent itemsets

  6. Summary Frequent Itemsets forest (SFI-forest) • Composed of a FI-list and a set of SFI-trees • SFI-trees • item-id, the item identifier • esup, the number of transactions reaching the node with the item-id • window-id, assigned to a new node of the current basic window identifier • node-link, links to the next node with the same item-idin the same SFI-tree

  7. Summary Frequent Itemsets forest (SFI-forest) • FI-list • item-id, the item identifier • esup, the number of transactions containing the item • window-id, assigned to a new entry of the current basic window identifier • head link, links to the root node of the item-id.SFI-tree

  8. Summary Frequent Itemsets forest (SFI-forest) • Each SFI-tree has a specific opposite frequent item list (OFI-list) • OFI-list • (item-id, esup, window-id, head link) • head link links to the first node carrying the item-id in the SFI-tree

  9. Example Transaction Projection (T)  abc bc c W1 abc bcd acd T = abc FI-list X = a X = b X = c (item-id, esup, window-id, node link) (1,1,1) a.SFI-tree a.OFI-list X = b X = c (2,1,1) (2,1,1) 1:1:1 2:1:1 3:1:1 (3,1,1) (3,1,1) b.SFI-tree 2:1:1 3:1:1 b.OFI-list c.SFI-tree (3,1,1) 3:1:1 c.OFI-list SFI-tree-maintenance (abc) SFI-tree-maintenance (bc) SFI-tree-maintenance (c)

  10. Example Transaction Projection (T)  bcd cd d W1 abc bcd acd T = bcd FI-list X = b X = d X = c (item-id, esup, window-id, node link) a.SFI-tree (1,1,1) b.OFI-list X = d X = c b.SFI-tree (2,1,1) (2,1,2) (3,1,2) (3,1,1) 2:1:2 2:1:1 3:1:2 3:1:1 4:1:1 (3,1,2) (3,1,1) (4,1,1) (4,1,1) c.SFI-tree c.OFI-list 3:1:2 3:1:1 4:1:1 (4,1,1) d.SFI-tree d.OFI-list 4:1:1 SFI-tree-maintenance (cd) SFI-tree-maintenance (d) SFI-tree-maintenance (bcd)

  11. Example Transaction Projection (T)  acd cd d W1 abc bcd acd T = acd FI-list X = d X = a X = c (item-id, esup, window-id, node link) (1,1,2) (1,1,1) a.SFI-tree a.OFI-list X = c X = d (2,1,2) 1:1:2 1:1:1 2:1:1 3:1:1 (2,1,1) (3,1,2) (3,1,3) (3,1,2) (3,1,1) 3:1:1 (4,1,2) (4,1,1) (4,1,1) 4:1:1 d.SFI-tree c.SFI-tree b.SFI-tree SFI-tree-maintenance (acd)

  12. Pruning infrequent items from SFI-forest • X:1-itemset in the FI-list • if X.esup < X.CL*ε then X and its supersets are deleted from SFI-forest • Step • 1 deletes • item-id.OFI-list • item-id.SFI-tree • the entry with item-id from the FI-list • 2 removes the infrequent item from other OFI-lists by traversing the FI-list

  13. Pruning infrequent items from SFI-forest • 3 deletes the infrequent item from other SFI-trees • 4 reconstructs SFI-trees by reinserting these modified item-suffix transactions or join the remainder subtrees into SFI-tree

  14. Example a.CL = b.CL = c.CL = d.CL = 12 s= 0.3, ε= 0.2 FI-list (1,1,3) (2,1,2) (3,1,3) (4,1,3) b.SFI-tree c.SFI-tree d.SFI-tree 12 x 0.2 = 2.4 a.SFI-tree 1:1:3 2:1:2 3:1:3 4:1:3 2:1:1 3:1:1 3:1:2 3:1:1 3:1:2 4:1:2 d.OFI-list 3:1:1 4:1:1 4:1:1 (4,1,2) c.OFI-list a.OFI-list (2,1,1) (3,1,2) (3,1,2) (4,1,1) b.OFI-list (4,1,1)

  15. Determining maximal frequent itemsets • There are k frequent 1-itemsets, e1, e2, …, ek, in the FI-list • o1, o2, …, oj, the items in the ei.OFI-list • Generates a candidate maximal frequent (j+1)-itemset, E = (ei, o1, o2, …, oj) • starts from a frequent item with the smallest estimated support • traverses the path via node link to count E’s estimated support

  16. Determining maximal frequent itemsets • if E.esup≥ s.ei.CL then E is MFI • else enumerate E into itemsets with size |E|−1 • until finds the set of all maximal frequent itemsets with respect to entry e

  17. Example a.CL = b.CL = c.CL = d.CL = 5 s= 0.3, ε= 0.2 FI-list (1,1,3) (2,1,2) (3,1,3) (4,1,3) b.SFI-tree c.SFI-tree d.SFI-tree 5 x 0.3 = 1.5 a.SFI-tree 1:1:3 2:1:2 3:1:3 4:1:3 2:1:1 3:1:1 3:1:2 4:1:2 d.OFI-list 3:1:1 4:1:1 4:1:1 Caculate support (bc) Caculate support (bcd) = 1 (4,1,2) c.OFI-list a.OFI-list (2,1,1) (3,1,2) (3,1,2) (4,1,1) b.OFI-list (4,1,1)

  18. Sliding Window Mining over Data Streams • Modifications: • uses DSM-MFI algorithm to construct a SFI-forest i for each basic window Wi • find local maximal frequent itemsets (local MFIi), all local MFI are stored in a queue • global MFI-list store all local MFI from W1 to WN

  19. Sliding Window Mining over Data Streams • When basic window N+1 arrives • removes the local MFI 1 from the queue • subtracts the support of the local MFI 1 from the global MFI • uses DSMMFI algorithm to mine all local maximal frequent itemsets of WN+1 • Increases the support of global MFI or insert local MFIN+1 into it

  20. Experiment • 1GHz IBMx24, 384MB, Visual C++ 6.0 • s= 0.1%, ε= 0.01%. • IBM synthetic datasets • T10.I5.D1000K • T30.I20.D1000K • the data is broken into 20 basic windows for simulating the streaming data

  21. Experiment

More Related