1 / 14

Building Efficient Time Series Similarity Search Operator

Building Efficient Time Series Similarity Search Operator. Mijung Kim Summer Internship 2013 at HP Labs. Overview. The internship project is a part of a project that: builds a scalable analytics framework and c onstructs a set of analytic operators within the framework

ulf
Download Presentation

Building Efficient Time Series Similarity Search Operator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs

  2. Overview • The internship project is a part of a project that: • builds a scalable analytics framework and • constructs a set of analytic operators within the framework • Trade-off performance with available resources • Multiple implementations with different trade-offs for each operator • Mechanism to choose an implementation given constraints • My goal is to build a time series similarity search operator • Parallel data processing • Alternative implementations for the time series similarity search

  3. What is Time Series? • Time series data is a sequence of data points repeatedly measured over time Example: Image from wikipedia http://en.wikipedia.org/wiki/Time_series

  4. Time Series Similarity Search Given a time series database (T) and query pattern (P), find k-nearest neighbors of the query in the database Query length (m) Time series Segment (T_i(j), …, T_i(j+m)) Time series database (T) • Use cases: • Targeted marketing, • Anomaly detection, many more… Query pattern (P) O(N_t *n*m) N_t: # time series, n: time series length, m: query length Linear to the query length –inefficient for large query lengths! Distance

  5. FFT (Fast Fourier Transform) based Search • Time series data in the time domain can be transformed to the frequency domain • We can compute the distance without a time series point by point comparison in each time series segment in the time domain. FFT for each time series can be pre-processed and re-used for each time series segment! Image from wikipedia http://en.wikipedia.org/wiki/Convolution O(N_t*n*logn) N_t: # time series, n: time series length Independent from the query length

  6. Time Series Search with MapReduce Query pattern Horizontally partitioned time series database Time Series Partition_1 Map_1 Top-K Query result Time series database Time Series Partition_2 Top-K Reducer Map_2 Top-K Top-K … … … Time Series Partition_n Map_n Compute the distance between each time series segment in the partition and the query

  7. FFT-based vs. Naïve Search Single machine vs. Cluster (e.g., >15X gain on cluster mode) FFT-based search cost is independent from the query length (efficient for larger query lengths but naïve search is better for smaller query lengths) - We can develop query plans based on the query length!

  8. Lessons so far • FFT is proven to be efficient in the time series similarity search operation but • There are other more (theoretically) efficient techniques for the time series similarity search operator, e.g., LSH • Parallel data processing with MapReduce on a cluster environment helps but • Lacks of rich data analytic algorithms commonly supported by statistical software such as MATLAB and R • We investigate frameworks that support R with MapReduceas a general analytic operation framework

  9. Why R + MapReduce? - R is a free software and a widely used programming language/framework/environment for statistical computation for data analysis and graphics - R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. Parallel Processing On Cluster Environment Rich Data Analytics Algorithms and Graphics In-Memory computation of R is impractical for large-scale data analysis!

  10. Parallel R(Split-apply-combine) apply split R functions partition combine R functions partition Aggregate function input : : : : R functions partition

  11. Examples (R+MapReduce) R instance (forecast) R function (ARIMA) Arima model input input R instance (forecast) R function (ARIMA) input Measure error Arima model input : : : : : : : : R function (ARIMA) Arima model input R instance (forecast) input Movie Ratings of each customer Arima (Autoregressive Integrated Moving Average) model of each customer Different training periods [IBM Ricardo, Das et al. SIGMOD ‘10] [Googleparallelism, Stokely et al. JSM ‘11]

  12. Time Series Search on RHIPE RHIPE (www.rhipe.org) - Open-source R package - Provides an abstraction layer that allows users to formulate MapReduce jobs in R scripts FFT R function R array R code Protocol buffer rJava (R <-> Java) Java code Java BytesWritable Map_1 Time Series Partition_1 Top-K Query result Time series database Map_2 Top-K Reducer Time Series Partition_2 Top-K Top-K … … … Time Series Partition_n Map_n Query pattern

  13. Summary • Built a time series similarity operator for a scalable data analytic framework • Working with mentors: Jun Li (System) and Krishnamurthy Viswanathan (Data scientist) • Played a role as a bridge to interoperate between parallel system and data analysis • : Designing parallel processing for data analytic algorithms and implementing the algorithms on cluster environment Parallel Processing On Cluster Environment (Hadoop) My Role Data Analysis (R, Matlab, C/C++)

  14. Conclusion (What I gain…) - Parallel data processing - Relational database - Java, MATLAB, C/C++, R, … - Machine learning algorithms Internship work Research work (+ industry experience) - Time series data analysis - Mathematical techniques (FFT/LSH) - Hadoop, JNI, … What’s more… - An invention disclosure regarding the time series similarity search filed in HP - Network with leading researchers in my research area

More Related