1 / 26

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window. Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen 2005/9/23 報告人:董原賓. The Characteristics of data streams. Continuity: Data continuously arrive at a high rate

remedy
Download Presentation

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen 2005/9/23 報告人:董原賓

  2. The Characteristics of data streams • Continuity: Data continuously arrive at a high rate • Expiration: Data can be read only once • Infinity: The total amount of data is unbounded

  3. The requirements of data streams • Time-sensitivity: A model that adapts itself to the time passing of a continuous data stream • Approximation: Because the past data cannot be stored • Adjustability: Owing to the unlimited amount of data, a mechanism that adapts itself to available resources is needed

  4. Definition B abc ac acd p • t:time point • p:time period • Basic block B:transactions arrive in [t-p+1, t] the basic block numbered idenote as Bi • |w|:length of the window • Θ:support threshold t-p+1 t time

  5. Definition ∑i = 5 TSi accd bd aba |W| = 3 • TS:time-sensitive sliding-window • TSi:the TS that consists of the |W| consecutive basic blocks from Bi-|W|+1 to Bi • ∑i:the number of transactions in TSi i-4 i-3 i-2 i-1 i i+1 time Bi-3 Bi-2 Bi-1 Bi Bi+1

  6. Time sensitive sliding window • The buffer continuously consumes transactions and pours them block-by-block into our system • Accuracy guarantees of no false dismissal (NFD) recall oriented or no false alarm (NFA) precision oriented are provided

  7. New itemset insertion • Each frequent itemset is inserted into PFP in the form of (ID, Items, Acount, Pcount), recording a unique identifier, the items in it, the accumulated count, and the potential count, respectively. • Acount accumulates its exact support counts in the subsequent basic blocks, while Pcount estimates the maximum possible sum of its support counts in the past basic blocks

  8. New itemset insertion • Check every frequent itemset discovered in Bi to see whether it has been kept by PFP. If it is, we increase its Acount. • Otherwise, we create a new entry in PFP and estimate its Pcount as the largest integer that is less than θ×∑i-1

  9. Old itemset update • For each itemset that is in PFP (potentially frequent-itemset pool) but not frequent in Bi, we compute its support count in Bi by scanning the buffer to update its Acount. • An itemset in PFP is deleted if its sum of Acount and Pcount is less than θ×∑i

  10. DT maintenance • Each itemset in PFP is inserted into DT (discounting table) in the form of (B_ID, ID, Bcount), recording the serial number of the current basic block, the identifier in PFP, and its support count in the current basic block, respectively

  11. Itemset discounting • Since the transactions in Bi-|W| will be expired, the support counts of the itemsets kept by PFP are discounted accordingly • If the itemset’s Pcount is nonzero, we subtract the support count thresholds of the expired basic blocks from Pcount • If Pcount is already 0, we subtract Bcount of the corresponding entry in DT from Acount • Each entry in DT where B_ID = i−|W| is removed

  12. TA update • TA (Threshold array):dynamically compute the support count thresholdθ×|Bi| for each basic bock Bi and store it into an entry in the threshold array • Only |W|+1 entries are maintained in TA

  13. Algorithm

  14. ab bcd ac abd bd acd 0 1 2 3 4 5 time |w| = 3 Block size = 1 Threshold = 0.4 Sliding window TA (Threshold Array) 1 2 3 4 t =1 block B1 B2 B3 B4 B5 2.4 Mining B1 frequent:a(4) b(4) c(3) d(4) bd(3) infrequent:ab(2) ac(2) ad(2) bc(1) cd(2) bcd(1) abd(1) acd(1) DT( Discounting Table) 1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 PFP (potentially frequent itemset pool) 1 a 4 0 2 b 4 0 3 c 3 0 4 d 4 0 5 bd 3 0 New itemset insertion DT maintenance TA update

  15. ab bcd ac abd bd acd 0 1 2 3 4 5 time |w| = 3 Block size = 1 Threshold = 0.4 a bc ad a bc Sliding window TA (Threshold Array) 1 2 3 4 t =2 block B1 B2 B3 B4 B5 2 2.4 2.4 Mining B2 frequent:a(3) b(2) c(2) bc(2) infrequent:d(1) ad(1) DT( Discounting Table) 1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 PFP (potentially frequent itemset pool) 1 a 7 0 2 b 6 0 3 c 5 0 4 d 5 0 5 bd 3 0 6 bc 2 2 1 a 4 0 2 b 4 0 3 c 3 0 4 d 4 0 5 bd 3 0 1 a 7 0 2 b 6 0 3 c 5 0 4 d 4 0 5 bd 3 0 6 bc 2 2 Old itemset update New itemset insertion DT maintenance TA update Pcount =

  16. ab bcd ac abd bd acd 0 1 2 3 4 5 time |w| = 3 Block size = 1 Threshold = 0.4 a bc ad a bc b bd c Sliding window TA (Threshold Array) 1 2 3 4 t =3 block B1 B2 B3 B4 B5 2 2.4 1.2 2 2.4 Mining B3 frequent:b(2) infrequent:c(1) d(1) bd(1) DT( Discounting Table) 1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 PFP (potentially frequent itemset pool) 1 a 7 0 2 b 6 0 3 c 5 0 4 d 5 0 1 a 7 0 2 b 8 0 3 c 5 0 4 d 5 0 1 a 7 0 2 b 8 0 3 c 6 0 4 d 6 0 Old itemset update New itemset insertion TA update DT maintenance

  17. ab bcd ac abd bd acd Sliding window 0 1 2 3 4 5 time |w| = 3 Block size = 1 Threshold = 0.4 a bc ad a bc ab abc ab ab b bd c TA (Threshold Array) 1 2 3 4 t =4 block B1 B2 B3 B4 B5 1.6 1.2 2 2.4 1.2 2 2.4 Mining B4 frequent:a(4) b(4) ab(4) infrequent:c(1) abc(1) DT( Discounting Table) 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4 1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 PFP (potentially frequent itemset pool) 1 a 7 0 2 b 8 0 3 c 6 0 4 d 6 0 1 a 3 0 2 b 4 0 3 c 3 0 4 d 2 0 1 a 7 0 2 b 8 0 3 c 3 0 4 d 2 0 5 ab 4 5 1 a 7 0 2 b 8 0 3 c 4 0 4 d 2 0 5 ab 4 5 Old itemset update New itemset insertion Itemset discounting TA update DT maintenance

  18. ab bcd ac abd bd acd Sliding window 0 1 2 3 4 5 time |w| = 3 Block size = 1 Threshold = 0.4 a bc ad a bc bc bc bc bc abc ab abc ab ab b bd c TA (Threshold Array) 1 2 3 4 t =5 block B1 B2 B3 B4 B5 2 1.6 1.2 2 1.6 1.2 2 2.4 Mining B5 frequent:b(5) c(5) bc(5) infrequent:a(1) ab(1) abc(1) DT( Discounting Table) 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4 5 1 1 5 2 5 5 5 1 5 6 5 5 7 5 PFP (potentially frequent itemset pool) 1 a 4 0 2 b 11 0 5 ab 4 2.6 6 c 5 4 7 bc 5 4 1 a 7 0 2 b 8 0 5 ab 4 5 1 a 4 0 2 b 6 0 5 ab 4 2.6 1 a 5 0 2 b 11 0 5 ab 5 2.6 6 c 5 4 7 bc 5 4 New itemset insertion Itemset discounting Old itemset update DT maintenance TA update

  19. Self-adjusting discounting table • In this approach, DT often consumes most of the memory space. When the space limit is reached, an efficient way to reduce the DT size without losing too much accuracy is required

  20. Selective adjustment • Each entry DTk is in the new form of (B_ID, ID, Bcount, AVG, NUM, Loss) • DTk.AVGkeeps the average of support counts for all the itemsets merged into DTk, DTk.NUMis the number of itemsets in DTk, while DTk.Lossrecords the merging loss of merging DTk with DTk-1

  21. Selective adjustment • The main idea is to select the entry with the smallest merging loss, called the victim, and merge it into the entry above it

  22. Merging loss and new Bcount • For k>1 and DTk.B_ID=DTk-1.B_ID 1. Under NFD (no false dismissal) mode Bcount = min {DTk.Bount, DTk-1.Bount} DTk.loss = (DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG) – min {DTk.Bount, DTk-1.Bount} x (DTk.Num + DTk-1.NUM) 2. Under NFA (no false alarm) mode Bcount = max {DTk.Bount, DTk-1.Bount} DTk.loss = max {DTk.Bount, DTk-1.Bount} x (DTk.Num + DTk-1.NUM) – (DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG)

  23. Example DT_limit = 4 Under NFD mode 1 6 10 1 1 12 1 3 13 1 1,3 12 12.5 2 ∞ 1 4 2 2 1 21 1 5 10 10 1 8 1 1 12 12 1 ∞ 1 3 13 13 1 1 1 4 2 2 1 11 1 5 10 10 1 8 1 1 12 12 1 ∞ 1 1,3 12 12.5 2 ∞ 1 4 2 2 1 21 1 5 10 10 1 8 1 6 10 10 1 0 1 1,3 12 12.5 2 ∞ 1 4 2 2 1 11 1 5 10 10 1 8 1 1 12 12 1 ∞ 1 3 13 13 1 1 (DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG) – min {DTk.Bount, DTk-1.Bount} x (DTk.Num + DTk-1.NUM) Loss = (1x13 + 1x12) – min{13, 12} x (1+1) = 25 – 24 = 1 Loss = (1x2 + 2x12.5) – min{2, 12} x (1+2) = 27 – 6 = 21 Loss = (1x10 + 1x10) – min{10, 10} x (1+1) = 20 – 20 = 0 AVG = AVG = (12x1 + 13x1) / 1+1 = 12.5

  24. Experiment • Intel Pentium-M 1.3GHz CPU • 256 MB main memory • Microsoft Windows XP Professional • The datasets streaming into this system are synthesized via the IBM data generator

  25. Experiment

  26. Experiment

More Related