1 / 52

Deterministic Error Guarantees for Queries on Compressed Time Series

Deterministic Error Guarantees for Queries on Compressed Time Series. Chunbin Lin Joint with Etienne Boursier , Jacque Brito, Korhan Demirkaya , Joshua Lapacik , Yannis Papakonstantinou. Motivation. Fast analytic query processing over historical time series is necessary

crystalj
Download Presentation

Deterministic Error Guarantees for Queries on Compressed Time Series

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DeterministicError Guarantees for Queries onCompressedTime Series Chunbin Lin Joint with Etienne Boursier, Jacque Brito, KorhanDemirkaya, Joshua Lapacik, YannisPapakonstantinou

  2. Motivation • Fast analytic query processing over historical time series is necessary • Future prediction • Abnormally detection • Similarity matching Compute the correlation of the foreign-exchange CAD/JPY and AUD/JPY CAD/JPY AUD/JPY public health analyst

  3. Challenge • Historical time series is big • 1 billion data points for each forex*1 • 8 TB operational data per day for each oil drilling rig*2 *1https://pepperstone.com/en/client-resources/historical-tick-data *2 https://wasabi.com/storage-solutions/internet-of-things/

  4. Solutions • Distributed query processing in many machines • Approximate query processing in a singlemachine • sampling methods Probabilistic error guarantees E.g., the actual answer is within with 95% confidence • our goal Deterministic error guarantees E.g., the actual answer is within with 95% confidence

  5. Data Time Series: a sequence of (timestamp, value) pairs • Assume queries involve time series with the same resolution • Omit timestamps • 1, 10000, • [ • 115.80, • 115.90, • 116.25, • 116.30, • 116.11, • 116.15, • 116.16, • 116.06, • 115.72, • ...... • ] • [ • (20170103931, 115.80), • (20170103932, 115.90), • (20170103933, 116.25), • (20170103934, 116.30), • (20170103935, 116.11), • (20170103936, 116.15), • (20170103937, 116.16), • (20170103938, 116.06), • (20170103939, 115.72), • ... ... • ] Apple stock price

  6. Query • Time subseries operators • Arithmetic operators (+,−×,÷,√ ) • E.g., 100+20, 100-20, 100*20, 100/20…

  7. Query • Statistic queries • Covariance, Correlation, Cross-correlation, ……

  8. Query • Statistic queries • Covariance, Correlation, Cross-correlation, …… base time series time series produced by time series operators

  9. Offline precomputation phase – building indexes

  10. Segment list index • Index: a list of compressed time series segments f(x) = a x + b segment Forex CAD/JPY (the Canadian Dollar and the Japanese Yen) • For each segment, we store: • Estimation function (minimize Euclidean distance) • Error measures (a , b) • L2-norm of errors: • Reconstruction error: • L2-norm of estimated values:

  11. Segment list index • Estimation function families no limitation on estimation functions polynomial function family exponential function family logarithmic function family logistic function family gaussian function family sin/cos function family

  12. Segment list index • Error guarantees • L2-norm of errors: • Reconstruction error: • L2-norm of estimated values: depends on the data values 5.4 4.8 f(x) = 1.2x + 2.0 3.0

  13. Segment list index • Existing index building algorithms • Fix-length segmentation (FL) : control segment size • Sliding-window segmentation (SW): control reconstruction error • …… CAD/JPY CAD/JPY AUD/JPY AUD/JPY E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289–296, 2001.

  14. Offline precomputation phase – building indexes • We build a segment list index for each time series • We store an estimation function and error measures for each segment

  15. Online query processing – providing deterministic error guarantees

  16. Error guarantees • Actual error: the absolute difference between the true answer R and the estimated answer , i.e., • Error guarantee: the upper bound of the actual error, i.e.,

  17. Error guarantees • Providing the error guarantee for each Sum(T) is the key base time series time series produced by time series operators If we can provide an error guarantee for each Sum(Ti), then we are able to give the error guarantee for general queries

  18. Query over single segment • Error guarantees for time series operators T1 T2

  19. Query over single segment • Error guarantee of Sum(T1 x T2)

  20. Query over single segment • Error guarantee of Sum(T1 x T2) = 0 if the estimation function family forms a vector space (VS) • Vector space: A set that is closed under finite vector addition and scalar multiplication • Polynomial function family is a vector space

  21. Orthogonal projection property in VS

  22. Orthogonal projection property in VS

  23. Query over single segment =0 =0 Orthogonal projection property in VS

  24. Query over single segment • Error guarantee of Sum(T1 x T2) Estimation function family is not VS Estimation function family is VS

  25. Query over aligned segments • Aligned segments • All the segments are perfectly aligned • Error guarantees • Sum of the error guarantees of each segment pair CAD/JPY AUD/JPY

  26. Query over aligned segments CAD/JPY AUD/JPY

  27. Query over misaligned segments • Misaligned segments • One segment overlaps with more than one segment CAD/JPY AUD/JPY

  28. Query over misaligned segments • Sum(T1 x T2) • Segment combination selection becomes an optimization problem • Minimize CAD/JPY AUD/JPY

  29. Query over misaligned segments • Segment combination selection • Intersection Strategy (IS) • Maximal number of segments • Optimal Strategy (OS) • Minimal error combination CAD/JPY AUD/JPY

  30. Query over misaligned segments • Orthogonal projection property • Cannot be applied, not aligned • Estimation function for a subsegmentmay not be in the family CAD/JPY Linear scalable family (LSF): the restriction of any function in LSF to a smaller domain is still a function in LSF PF LSF AUD/JPY VS ANY LSF is a superset of the polynomial function family (PF)

  31. Query over misaligned segments • Sum(T1 x T2) • If estimation functions are in LSF CAD/JPY AUD/JPY

  32. Error guarantee properties • Tightness • With the same error measures, no other error guarantee is smaller than it for queries on all the data • Amplitude-independence (AI) • Not using the amplitudes in the error guarantees E.g., Changing from Celsius to Kelvin will not change the error guarantees

  33. Error guarantee properties Queries on aligned segments Function family Queries on misaligned segments AI Tight AI Tight Sum(T1 x T2) ANY\VS VS\LSF LSF ANY Sum(T1+ T2) Sum(T1- T2) ANY

  34. Error guarantee properties • Dichotomies of function families LSF VS ANY\LSF ANY\VS AI AI non-AI non-AI Queries on misaligned segments Queries on aligned segments

  35. Experiments • Dataset

  36. Experiments • Estimation functions [1] [2] [3] E. Keogh. Fast similarity search in the presence of longitudinal scaling in time series databases. In ICTAI, pages 578–584, 1997. M. Tobita. Combined logarithmic and exponential function model for fitting postseismicgnsstime series after 2011 tohoku-oki earthquake. Earth, Planets and Space, 68(1):41, 2016. Z. Pan, Y. Hu, and B. Cao. Construction of smooth daily remote sensing time series data: a higher spatiotemporal resolution perspective. Open Geospatial Data, Software and Standards, 2(1):25, 2017.

  37. Experiments • Segment list building algorithms • Fix-length segmentation (FL) • Sliding-window segmentation (SW) • Queries • Correlation query • cross-correlation query

  38. Experiments • Error guarantees for queries on aligned time series • 20 correlation queries • FL segment lists building Power of orthogonal property • VS uses less space than ANY • VS uses 0.035% while ANY uses 0.06%

  39. Experiments • Error guarantees for queries on misaligned time series • 20 correlation queries • SW segment lists building 2 1 Effect of LSF (~100x) 1 Effect of optimal segment combination selection (~10x) 2

  40. Experiments • Aligned vs. misaligned • Fix space, compare error guarantees for aligned and misaligned • K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K Misaligned produces smaller true errors 1 Misaligned produces smaller error guarantees 2 ~ 3x for ANY 2 1

  41. Experiments • Aligned vs. misaligned • Fix space, compare error guarantees for aligned and misaligned • K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K Misaligned produces smaller error guarantees 1 ~ 8.2 x for LSF 1

  42. Experiments • Index building time • Query processing time

  43. Experiments • Compare with sampling method • uniform random sampling scheme with a global seed Sampling size to provide same error guarantees with those of VS Sampling size to provide same error guarantees with those of ANY confidence

  44. Conclusion • Provide deterministic error guarantees for statistic queries over aligned segments and misaligned segments. • Provide optimizations to reduce the error guarantees in both scenarios. • Study the properties – AI and tight– of the proposed error guarantees • Conduct experiments to evaluate the error guarantees

  45. Future work Deterministic error guarantees for interactive analytic queries over compressed time series

  46. Architecture • Build segment tree index for each time series (offline) • A node refers to a compressed segment • Each segment, we store estimation function and error measures • Tree may not be a balanced tree • Navigate trees to access minimal number of nodes to get answers with error guarantees less than given threshold value (online)

  47. Segment tree index • One tree structure for each time series • A node refers to a compressed segment • Each segment, we store estimation function and error measures • Tree may not be a balanced tree • Segment tree building algorithms: • Top-down algorithm • Bottom-up method • Sliding-window approach *Fu, Tak-chung. "A review on time series data mining." Engineering Applications of Artificial Intelligence 24.1 (2011): 164-181.

  48. Query processing algorithm • Given query and error budget, access minimal number of nodes to get approximate answers with error guarantees less than the error budgets Consider query = (Agg(Times(T1, T2)), 10% Time series T2 Time series T1

  49. Query processing algorithm • Performance-wise optimization An incrementalupdatesegmentation algorithm that gives ratio compared with the optimal one. • Space-wise optimization Avoid storing the estimation functions for the right nodes. Estimation function can be deduced from the parent node and the left sibling node via an invert basis matrix Only red nodes store estimation functions

  50. Thank you Q&A

More Related