1 / 33

Model Maintenance in Dynamic Environments

Model Maintenance in Dynamic Environments. Venkatesh Ganti (Joint work with Raghu Ramakrishnan, Johannes Gehrke, Mong Li Lee). Mining Environment. Data repository for analysis Data mining models Frequent itemsets Decision trees Clusters … OLAP Aggregate queries

dillon
Download Presentation

Model Maintenance in Dynamic Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model Maintenance in Dynamic Environments Venkatesh Ganti (Joint work with Raghu Ramakrishnan, Johannes Gehrke, Mong Li Lee)

  2. Mining Environment • Data repository for analysis • Data mining models • Frequent itemsets • Decision trees • Clusters • … • OLAP • Aggregate queries • Repository updated regularly • Query workloads change Data Mining Data Warehouse …… …… OLAP DEMON

  3. Two Parts of this Talk • Model Maintenance:Maintaining models under systematic data evolution [ICDE 00] • Tuning samples: Maintaining samples for approximate query answering with respect to changing query workloads [VLDB00] DEMON

  4. d … D+d Systematic Block Evolution • Data warehouses are updated with blocks of new data • Block: a set of tuples appended simultaneously to the data warehouse D Result: a sequence of database snapshots DEMON

  5. Model Maintenance: Objective • Allow selection of interesting time-varying subsets to be modeled • Low response time to get the updated model • Interesting classes of models • Frequent itemsets (LITS) • Clusters • Decision trees (DT) DEMON

  6. M(D1+D2+D3+D4) D1 D2 D4 D3 Subset selection: Data Span • Span of interest • Everything until now—Unrestricted window • Recently collected—Most recent window • Unrestricted Window (UW) • Model the entire database M(D1+D2+D3) D1 D2 D3 DEMON

  7. D1 D2 D3 D4 M(D2+D3+D4) D1 D2 D4 D5 M(D3+D4+D5) D3 Data Span (contd.) • Most Recent Window (MRW) of size w • E.g., model data collected in the last 3 days Sliding Windows Models M(D1+D2+D3) D1 D2 D3 DEMON

  8. Block Selection Sequence • Maintain models on data collected on alternate days within the last 4 weeks • Require fine granular selection • Block selection sequence (BSS) • A 0/1 sequence: a bit for each block in the data span • 1--the block is selected for modeling • 0--the block is not selected for modeling DEMON

  9. BSS: UW • A sequence of 0/1 bits, one for each block in the entire database • E.g., select all blocks collected on alternate days 1 0 1 0 1 D3 D4 D1 D2 D5 DEMON

  10. 1 0…0 1 0…0 1 ... D1 D8 D2-D7 D9-D14 D15 M(D1+D8+D15) BSS: MRW • Two types of BSS w.r.t. MRW • Window-independent • Window-relative • Model data collected on Mondays within the last 4 weeks • BSS: (1000000)* 1 0…0 1 0…0 1 ... D1 D8 D2-D7 D9-D14 M(D1+D8) DEMON

  11. D3 D1 D2 D4 [1 0 1] D3 D4 D1 D2 D5 [1 0 1] BSS: MRW (contd.) • Window-relative BSS • Model all data collected on alternate days from the start in a window of size 3 • BSS: 101 D3 D2 D1 [1 0 1] Here, each successive subset is disjoint from its predecessor DEMON

  12. Model Maintenance: Enumeration LITS Clustering DT UW:BSS MRW:BSS Includes both window-independent and window-relative block selection sequences. DEMON

  13. Model Maintenance: Algorithms LITS Clustering DT UW:BSS GEMM(A) MRW:BSS GEMM: GEneric Model Maintenance Algorithm for any class of models that has an incremental maintenance algorithm A under tuple insertions DEMON

  14. Maintenance under Insertions • Algorithm A • Input: old dataset D, old model M(D), a block of tuples d appended to D • Output:M(D+d) =A(D, d, M(D)) • Such algorithms exist for • Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) • Clusters (BIRCH) • Decision trees (BOAT) • Note: We do NOT require A to handle deletions! DEMON

  15. GEMM • Input • Data span (and window size for MRW) • BSS • A model-update algorithm A under tuple insertions (deletions not required) • Output • An efficient model maintenance algorithm DEMON

  16. M(D1+D2+D3) D1 D2 D3 T3 M(D2+D3+D4) D1 D2 D4 D3 T4 D2 D4 D5 M(D3+D4+D5) D1 D3 T5 GEMM:MRW • Assume BSS is a sequence of 1’s and w=3 • We already know parts of future windows DEMON

  17. GEMM: MRW (contd.) Idea: Start building models for future windows E.g.,: At T3, we maintain models on <D1+ D2+D3> (model required for window at T3) <D2+D3> (partial model for window at T4) <D3> (partial model for window at T5) Models at T3 M<D1 + D2 + D3> M<D2 + D3> M<D3> Models at T4 M<D2 + D3 + D4> (for window at T2) M<D3 + D4> (for window at T3) M<D4> (for window at T4) Immediate Offline DEMON

  18. GEMM: Arbitrary BSS 1 0 1 0 1 ... T3: Model on <1.D1 + 0.D2 + 1.D3> T4: D4 is appended Model on <0.D2 + 1.D3 + 0.D4> T5: D5 is appended Model on <1.D3 + 0.D4 + 1.D5> D1 D1 D2 D4 D3 D5 Idea: We still know parts of future windows and the corresponding BSS for each of them E.g.,: At T3, we maintain models on <1.D1+0.D2+1.D3> (model required at T3) <0.D2+1.D3> <1.D3> identical DEMON

  19. GEMM: Resource Requirements • Response time to new model • Updating one model with the new block • Other updates offline • Depends on the incremental algorithm • Space requirements • At most w models • Space required for a model is orders of magnitude less than that for data! DEMON

  20. Maintenance under Insertions • Algorithm A • Input: old dataset D, old model M(D), a block of tuples d appended to D • Output:M(D+d) =A(D, d, M(D)) • Such algorithms exist for • Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) • Clusters (BIRCH) • Decision trees (BOAT) • Note: We do NOT require A to handle deletions! DEMON

  21. Frequent Itemset Models • Set of customer transactions • Frequent itemset: a set of items purchased together by “many” customers Minimum frequency threshold = 50% {b}, {c}, {a,c} are frequent itemsets DEMON

  22. Incremental Algorithm [FAAM97,TBAR97] D4 • Input • Old dataset • Old set of frequent itemsets • New block D4 • Steps • Detect if new itemsets become frequent • Count frequencies of a small number of itemsets • Current algorithms scan(D1+D2+D3) completely • Update model… D1 D2 D3 DEMON

  23. ECUT—New Counting Algorithm • Transformed data representation • Within each block Di • item x: sorted list of transaction identifiers containing “x”—TID-list(x) TID-list(a) = {1} TID-list(b) = {2,3} TID-list(c) = {1,2} Count({a,b}) = |TID-list(a) intersection TID-list(b)| DEMON

  24. Experimental Comparison DEMON

  25. Comparing Count Times DEMON

  26. Summary of the first part LITS Clustering DT UW:BSS GEMM MRW:BSS • Maintenance algorithms under tuple insertions • Frequent itemsets • ECUT, ECUT+ • Clusters • BIRCH • Decision Trees • BOAT DEMON

  27. Second Part of this Talk • Model Maintenance:Maintaining models under systematic data evolution [ICDE 00] • Maintaining samples with respect to changing query workloads [VLDB00] DEMON

  28. S(R) Random Samples for AQUA Agg. query Q • All tuples in R are assumed to be equally important while drawing S(R) • In practice, queries exhibit locality • Consequence: S(R) wastes precious real estate Exact answer Typical AQUA approach R Approx. answer Uniform Random sample DEMON

  29. Problem • Given • Relation R • Workload W: Q1,…,Qn • Goal: Dynamically tune “random sample of R” w.r.t. W • Model to be maintained: a simple random sample DEMON

  30. R Uniform Random Sample R(Q1) SW(R) ICICLE R(Qn) ICICLES • R(Q): set of tuples in R required to answer Q • Random sample of R U R(Q1) U … U R(Qn) • Tuples required often are more likely to be in SW(R) DEMON

  31. Mail Order Dataset DEMON

  32. Conclusions and Future Work Static dataset Dynamic dataset Workload indifferent Workload sensitive DEMON

  33. Questions? DEMON

More Related