1 / 18

Online Mining of Frequent Query Trees over XML Data Streams

Online Mining of Frequent Query Trees over XML Data Streams. Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/ *: corresponding author. Outline. Introduction

marcel
Download Presentation

Online Mining of Frequent Query Trees over XML Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/ *: corresponding author hfli@csie.nctu.edu.tw

  2. Outline • Introduction • Mining of Data Streams, Tree Mining • Problem Definition • Online Mining of Frequent Query Trees over XML Data Streams • The Proposed Algorithm • FQT-Stream (Frequent Query Trees of Streams) • Conclusions and Future Work hfli@csie.nctu.edu.tw

  3. Mining of Data Streams: Motivations • Many Applications generate data streams • Day to day business (credit card, ATM transactions, etc) • Hot Web services (XML data, record and click streams) • Telecommunication (call records) • Financial market (stock exchange) • Surveillance (sensor network, audio/video) • System management (network events) • Application characteristics • Massive volumes of data (several terabytes) • Records arrive at a rapid rate • Data distribution changes on the fly • What do we want to get from data streams ? • Real time query answering, Statistics, and Pattern discovery hfli@csie.nctu.edu.tw

  4. Synopsis in Memory Buffer Stream Mining Processor (Approximate) Results Data Streams Mining of Data Streams: Computation Model • Requirements of Mining Data Streams • Single pass: each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time (to maintain synopsis) must be low hfli@csie.nctu.edu.tw

  5. Problem Definition of Frequent Query Tree Mining (1/2) • XML Query Tree Stream (XQTS) • A sequence of query trees (QTs) • QT1, QT2, …, QTN • N is tree id the latest incoming query tree • Support of a Query Tree QTi • sup(QTi): the number of QTs in XQTS containing QTi as a subtree hfli@csie.nctu.edu.tw

  6. Problem Definition of Frequent Query Tree Mining (2/2) • A QTi is a Frequent Query Tree (FQT) • if and only if sup(QTi)  sN • s is a user-defined minimum support threshold in the range of [0, 1] • Our Task • To mine the set of all frequent query trees (FQTs) by one scan of the XQTS • Using as smaller memory as possible hfli@csie.nctu.edu.tw

  7. Proposed Algorithm FQT-Stream (Frequent Query Trees of Streams) • FQT-Stream consists of 5 phases • 1. read a QT (Query Tree) from the buffer in the main memory • 2. transform the QT into a new NQTS (Normalized Query Tree Sequence) representation • 3. construct a in-memory summary data structure called FQT-forest (a forest of Frequent Query Trees) by projecting the NQTSs • 4. prune the infrequent query trees from FQT-forest • 5. find the set of all FQTs (Frequent Query Trees) from current FQT-forest • Since phase 1 is straightforward, • We focus on phases 2-5 hfli@csie.nctu.edu.tw

  8. Phase 2 of FQT-Stream: NQTS Transformation • NQTS Transformation of QT • Using DFS on the QT • A sequence of triple (node-id, level, order) • level: the level of the QT • order: sequence order of the NQTS • For example (5-NQTS in Figure 1) hfli@csie.nctu.edu.tw

  9. Phase 3 of FQT-Stream: FQT-forest Construction (1/4) • For each NQTS, 2 steps are performed to construct the FQT-forest • Step 1: enumerate each NQTS into a set of sub-sequences using Order-Break (OB) technique • OB is a level-wise method hfli@csie.nctu.edu.tw

  10. Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (2/4) • For example, a 5-NQTS = <(A, 0, 1), (B, 1, 2), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • First, the 5-NQTS is broken into three4-NQTSs • <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • <(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)> • <(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)> • These sequences are 1-OB (One Order Break) • 1-OB sequences have oneorder break in the sequence order • The original 5-NQTS is called 0-OB hfli@csie.nctu.edu.tw

  11. Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (3/4) • After delete the duplicates • Three 4-NQTSs  Two 3-NQTSs with One Order Break • Two 3-NQTSs  One 2-NQTS • <(A, 0, 1), (E, 2, 4), (C, 1, 5)>, <(A, 0, 1), (B, 1, 2), (C, 1, 5)><(A, 0, 1), (C, 1, 5)> • Finally, the set of 1-OB contains 8 NQTSs hfli@csie.nctu.edu.tw

  12. Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (4/4) • Set of 2-OB is generated from the set of 1-OB • For example • 2-OB <(A, 0, 1), (D, 2, 3), (C, 1, 5)> is generated from 1-OB <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • Repeat this process until no candidate k-OB • Property 1 • The maximum size of order break is k-3, i.e., (k-3)-OB, if the query tree has k nodes hfli@csie.nctu.edu.tw

  13. Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (1/3) • The OBs (0-OB, 1-OB, 2-OB) are projected and inserted into a FQT-forest using Incremental Projection (IP) technique • A NQTS, <X1X2…Xi>, with i nodes is projected into i sub-NQTSs (also called node-suffix NQTSs) • <Xi>, <XiXi-1>, …, <X2>, <X1> • We use one field node-id to represent the fields (node-id, level, order) for simplicity hfli@csie.nctu.edu.tw

  14. Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (2/3) • Example of IP • 1-OB: <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> is projected into 4 node-suffix NQTSs as follows • <(C, 1, 5)> • <(E, 2, 4), (C, 1, 5)> • <(D, 2, 3), (E, 2, 4), (C, 1, 5)> • <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> • After projection, a tree structure checking is preformed • If the level of the first node in a node-suffix NQTS is not the smallest level • the node-suffix NQTS is deleted hfli@csie.nctu.edu.tw

  15. Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (3/3) • After tree structure checking • The node-suffix NQTSs are inserted into FQT-forest • Update the corresponding nodes’ supports • FQT-forest consists of 2 parts • FN-list • A list of Frequent Nodes • Each node Xi in FN-list has a NQTS-tree (Xi.NQTS-tree) • NQTS-trees (trees of Normalized Query Tree Sequences) • A sequence (NQTS) is represented by a path • And its appearance frequent is maintained in the last of node of the path hfli@csie.nctu.edu.tw

  16. Phase 4 of FQT-Stream: Infrequent Information Pruning • In order to guarantee the limited space requirement • Pruning Infrequent Information • Pruning steps • Check each node Xi in the FN-list of FQT-forest • If its sup(Xi) < sN delete Xi and its NQTS-tree • Check other NQTS-trees to prune these infrequent nodes hfli@csie.nctu.edu.tw

  17. Phase 4 of FQT-Stream: Frequent Query Tree Mining • Assume that there are k frequent nodes, <X1, X2, …, Xk>, in the FN-list • FQT-Stream traverses the Xi.NQTS-tree (i, i = 1, 2, …, k) to find the sequences with prefix Xi whose estimated support is greater than or equal to sN in a DFS manner • These frequent query trees are stored into a temporal list, called FQT-List hfli@csie.nctu.edu.tw

  18. Conclusions and Future Work • We propose an efficient one-pass algorithm FQT-Stream (Frequent Query Trees of Streams) • To find the set of all frequent query trees over the entire history of online XML data streams • Future Work • Online Mining of Frequent Query Trees over Sliding Windows hfli@csie.nctu.edu.tw

More Related