1 / 30

Top-k Queries on Temporal Data

Top-k Queries on Temporal Data. Feifei Li 1 , Ke Yi 2 , Wangchao Le 1 Florida State University HongKong University of Science & Technology. Problem Def. Temporal data: temporal data refer to data that change over time. Typical examples - stock traces - objects’ trajectories. .

theo
Download Presentation

Top-k Queries on Temporal Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top-k Queries on Temporal Data Feifei Li1, Ke Yi2, Wangchao Le1 Florida State University HongKong University of Science & Technology

  2. Problem Def. • Temporal data: temporal data refer to data that change over time. • Typical examples - stock traces - objects’ trajectories. Score Time

  3. Problem Def. • For the efficiency of storage, indexing , queries, etc., time series are often represented as piecewise linear functions, each called a Piecewise Linear Approximation (PLA). Score Score Time Time Each PLA is called an object. An PLA object with 4 line segments.

  4. Problem Def. (cont.) • Ranking Queries on Temporal data : top-k queries on time instants. Given a set of PLA objects {oi|i=1 … n}, a time instant t and k, a top-k/t query retrieves the k objects that have the highest scores on time instant t.

  5. State of the Art • Use R-tree R-tree revisit: - Index multi-dim. info. - linear space - Branch and bound with a priority queue - Do NOT have a worst case query cost guarantee (linear scan in worst case). • Treat an object as a trajectory - Break up each trajectory into pieces of segments - R-tree is built on pieces of segments • Use kNN query at time t -Adding an artificial query point that is high enough (example in next slide).

  6. State of the Art (cont.) • kNN query at time t using R-tree - Use min. snapshot distance (MinSTDist.), distance along time instance t from q. • Branch & bound with MinSTDist • - Stop when there are k objects in the priority queue whose MinSTDist are smaller than other unseen objects.

  7. State of the Art (cont.) • Efficiency of R-tree based approach - Linear space consumption - Handle queries on higher dimensional problems • Deficiency of R-tree based approach - Do not have worse case performance guarantee (build, query) - Current commercial DBMSs have limited supports on R-tree

  8. Our contribution • We propose seb-tree, the Sampled Envelope B-tree. • Simplicity - B-tree is the only building block , easily to integrate into commercial DBMSs • Optimal query performance - Answer a top-k/t query in logarithm I/O on expectation • Handle update - 99.5% updates will end up in simple insertions/deletions - Only 0.5% updates need to lock and modify a larger portion of the B-tree • Size & construction - Occupy near linear space - Require near linear time to build.

  9. Seb-tree (rand. sampling) • Let S be a set of N line segments in the plane • Build series of random sampling on S - Define l independent sampling ratio pi (0≤i≤l) - Sampling on S with pi - Sampled set Si & unsampled set USi - l+1 groups of Si and USi • How to decide l and pi? - , kmax is the highest possible k - pi is a geometrically decreasing series : 1/(2iB), i= 0, 1, …, l, B is the # of segments can be hold in a disk block

  10. Seb-tree ( the upper envelope) • For each sample Si, compute its upper envelope envi - What’s upper envelope? • Upper envelope can be computed in near linear time (1989) A random sampled set Si Si and its upper envelope envi

  11. Seb-tree ( the trapezoidal decomp.) • For each vertex on envi - shoot up a vertical line - if it is an endpoint of a segment, also shown down until it hits another segment or score=0. • This results the trapezoidal decomposition of Si: D(Si). Si and its decomposition Si and its upper envelope envi

  12. Seb-tree (the conflict list) • Conflict - consider a trapezoid ∆ from some D(Si) and s USi - we say s conflicts with ∆ if s intersects ∆ • Conflict list - for each ∆, find all s USiconflicted with it (do we need to consider s Si?) - collect all such segments into a list, which is named conflict list C(∆) Sa Sd C(∆)= {Sa, Sb, Sc, Sd, Se} Sb Sc Se ∆

  13. Seb-tree (the index) • Let ∆1, ∆2, …, ∆t be the trapezoids of D(Si) from left to right - sort by the starting x value of ∆ • Build a B-tree Ti on C(∆1), C(∆2), …, C(∆t) in order • Build a B-tree for each level of sampling - totally we have l+1 B-trees

  14. Size of seb-tree Lemma 1 (1989): E(|C(∆)|)=O(1/p) By Lemma1, for a ∆ on level i, E(|C(∆)|)=O(2iB) Lemma 2 (1986): There are O(n*α(n)) vertices on the upper envelope of n line segments in the plane, where α(n) is the inverse Ackermann function and can be treated as a constant of all imaginable input size. - for Si, it has expected O(1/2i*N/B* α(N/B)) trapezoids - for B-tree Ti, it occupied O(N*α(N/B)) blocks. • Size of seb-tree ForB-trees, the size of seb-tree is

  15. More on seb-tree • Each line segment might intersect with multiple trapezoids • How to build the conflict list efficiently • Hierarchical decomposition • Conflict lists can be build in near linear time.

  16. The hirarchical decomposition • Let L0 be the set of segments in Si, we then build a gradation where Lj is ½ sampling of Lj-1, λ=O(log|L0|) L0 L1 L2

  17. The hirarchical decomposition • For each Lj, we build its trapezoidal decomposition D(Lj) L0 L1 L2

  18. The hierarchical decomposition • For each Lj, we build its trapezoidal decomposition D(Lj) • We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(Lλ) L0 L1 L2

  19. The hierarchical decomposition • For each Lj, we build its trapezoidal decomp. D(Lj) • We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(Lλ) • Store all trapezoids in this hierarchy in a tree (HDT). L0 L1 L2

  20. The hierarchical decomposition • To judge which C(∆) a line segment belongs to at L0, we search top-down from Lλ, visiting a ∆ if only if the segment intersect with it. f g L0 seg2 seg1 d e b seg2 L1 b a L2 seg2 seg1

  21. Cost on building conflict lists For a particular level Si, the decomp. has a height of λ=O(log|Si|) • For a segment s, the time it spent to visit the HDT will be proportional to the size of the HDT, which is • At Lj, its conflict list has an expected size E(|C(∆)|)=O(2i+jB) • |Lj|= O(N/2i+jB), there are O(|Lj|α(|Lj|)) trapezoids in D(Lj), so D(Lj) has an expected size of O(N*α(N/B)*log(N/B)) • The total time spent on the entire l+1 HDTs is

  22. Query on seb-tree • Query on seb-tree is simple (in 1 for-loop) - Given k and a time instant t, initiate i=0 1. use B-tree Ti, do point search and find ∆ whose x-span contains t, read its conflict list C(∆) 2. if there are at least k segments in C(∆) intersect with t, return the top-k segments, else if i<l, then i=i+1, repeat step 1 3. scan entire S to find top-k segments to find top-k - An improvement is that instead of letting i=0 at the first step, we can directly start at level i=log(k/B) (because 2iB need to larger than k).

  23. Query cost • Query performance guarantee comes from B-tree • For any query, seb-tree index can find the top-k/t segments in expected O(logBN+k/B) I/Os • The probability that seb-tree needs to trigger a brute force scan is less than B/N, and scanning the whole data set needs O(N/B) I/Os, this adds only O(1) to the total query cost.

  24. Updating the seb-tree • Recall that to build a B-tree at level i, we need to - take a 1/2iB sampling on S to get Si - build a trapezoidal decomp. D(Si) - store the conflict list in the level i B-tree • Given a new segment s - if it changes none of the D(L0), …, D(Lλ), then simply follow the HDT to check where s belongs to. - if it does change one of the D(L0), …, D(Lλ), then we need to rebuild a larger potion of the seb-tree. • Deletion can be handled similarly.

  25. Space-query tradeoff • Based on lemma 1: One will expect to see O(1/p) conflicting segments for any trapezoid on level Si, where p is sampling rate = 1/2iB • To avoid expensive I/O, we define threshold λ, when |C(∆)| > λ O(1/p), simply don’t store it (for query part, skip it) • In practice, λ=3 or 4 Sa Sd |C(∆)|=O(1/p) Sb Sc ∆ Se

  26. Experiment • How seb-tree will behave when … 1) the number of time series changes 2) the deviation of time series changes 3) the threshold λ changes 4) Kmax in changes • Compare to R-tree

  27. Experiment (1) • Index size & construction time

  28. Experiment (2) • Query cost

  29. Experiment (3) • Effect of Kmax

  30. Conclusion • Study ranking queries on temporal data • Propose seb-tree • Take near-linear time to construction • Occupy near-linear space • Support dynamic update efficiently • Employ B-tree as its only building block.

More Related