High Performance Algorithms for Multiple Streaming Time Series Xiaojian Zhao Advisor: Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York University Jan. 10 2006
Roadmap: • Motivation • Incremental Uncooperative Time Series Correlation • Incremental Matching Pursuit (MP) (optional) • Future Work and Conclusion
Motivation (1) • Financial time series streams are watched closely by millions of traders. • “Which pairs of stocks were correlated with a value of over 0.9 for the last three hours? Report this information every half hour” (Incremental pairwise correlation) • “How to form a portfolio consisting of a small set of stocks which replicates the market? Update it every hour” (Incremental matching pursuit)
Motivation (2) • As processors speed up, algorithmic efficiency no longer matters … one might think. • True if problem sizes stay same. But they don’t. • As processors speed up, sensors improve • Satellites spewing out more data a day • Magnetic resonance imagers give higher resolution images, etc.
High performance incremental algorithms • Incremental Uncooperative Time Series Correlation • Monitor and report the correlation information among all time series incrementally (e.g. every half hour) • Improve the efficiency from quadratic to super-linear • Incremental Matching Pursuit (MP) • Monitor and report the approximation vectors of matching pursuit incrementally (e.g. every hour) • Improve the efficiency significantly
Problem statement: • Detect and report the correlation incrementally and rapidly • Extend the algorithm into a general engine • Apply it in practical application domains
Correlated! Correlated! Online detection of high correlation
Pearson correlation and Euclidean distance • Normalized Euclidean distance Pearson correlation • Normalization • dist2=2(1- correlation) • From now on, we will not differentiate between correlation and Euclidean distance
Naïve approach: pairwise correlation • Given a group of time series, compute the pairwise correlation • Time O(WN2),where • N : number of streams • W: window size (e.g. in the past one hour) Let’s see high performance algorithms!
Technical review Framework: GEMINI Tools: Data Reduction Techniques • Deterministic Orthogonal vs. Randomized • Fourier Transform, Wavelet Transform, and Random Projection Target: Various Data • Cooperative vs. Uncooperative
GEMINI Framework* Data reduction, e.g. DFT, DWT, SVD * Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases,. SIGMOD, 1994
GEMINI: an example • Objective: find the nearest neighborhood (L2-norm) of each time series. • Compute the Fourier Transform over each of them, e.g. X and Y; yield two coefficient vectors Xf and Yf • Xf=(a1, a2, …ak) and Yf=(b1, b2, …bk) • Original distance vs. coefficient distance (Parseval's Theorem) • Because, for some data types, energy concentrates on first a few frequency components, coefficient distance can work as a very good filter and at the same time guarantee no false negatives • They may be stored in a tree or grid structure
Review: DFT/DWT vs. Random Projection Fourier Transform, Wavelet Transform and SVD • A set of orthogonal base (deterministic) • Based on Parseval's Theorem Random Projection • A set of random base (non-deterministic) • Based on Johnson-Lindenstrauss (JL) Lemma Orthogonal Base Random Base
Review: Random Projection: Intuition • You are walking in a sparse forest and you are lost. • You have an outdated cell phone without a GPS (w/o latitude&altitude). • You want to know if you are close to your friend. • You identify yourself at 100 meters from Bestbuy and 200 meters from a silver building etc. • If your friend is at similar distances from several of these landmarks, you might be close to one another. • Random projections are analogous to these distances to landmarks.
Random Projection inner product sketches time series random vector Sketch: A vector of output returned by random projection
Review: Sketch Guarantees * Johnson-Lindenstrauss ( JL) Lemma: • For any and any integer n, let k be a positive integer such that • Then for any set V of n points in , there is a map such that for all • Further this map can be found in randomized polynomial time • W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984
Empirical study : sketch approximation Time series length=256 and sketch size=30
Empirical study: sketch distance/real distance Sketch=30 Sketch=1000 Sketch=80
Data classification • Cooperative • Time series exhibiting a fundamental degree of regularity, allowing them to be represented by the first few coefficients in the spectral space with little loss of information • Example: Stock Price (random walk) • Tools: Fourier Transform, Wavelet Transform, SVD • Uncooperative • Time series whose energy is not concentrated in only a few frequency components, e.g. • Example: Stock Return (= ) • Tool: Random Projection
DFT on random walk and white noise Cooperative Uncooperative
Approximation Power: SVD Distance vs. Sketch Distance • Note: SVD is superior to DFT and DWT in approximation power. • But all of them are all bad for uncooperative data. • Here sketch size = 32 and SVD coefficient number =30
Our new algorithm* • The big picture of the system • Structured random vector (New) • Compute sketch by structured convolution (New) • Optimize in the parameter space (New) • Empirical study • Richard Cole, Dennis Shasha and Xiaojian Zhao. “Fast Window Correlations Over Uncooperative Time Series”. SIGKDD 2005
Big Picture time series 1 time series 2 time series 3 … time series n … sketch 1 sketch 2 … sketch n … Correlatedpairs Random Projection Grid structure Data Reduction Filtering
Our objective reminded • Monitor and report the correlation periodically e.g. “every half hour” • We chose Random Projection as a means to reduce the data dimension • The time series needs to be looked at in a time window. • This time window should slide forward as time goes on.
Definitions: Sliding window and Basic window Basic window (bw) Time point Stock 1 Stock 2 Stock 3 …… Stock n Sliding window size=8 Basic window size=2 Sliding window (sw) Time axis Example: Every half hour (bw) report the correlation of the last three hours (sw)
Random vector and naïve random projection • Choose randomly sw random numbers to form a random vector R=(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12) • Inner product starts from each data point Xsk1=(x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12)*R Xsk2=(x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13)*R Xsk3=(x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14)*R …… • We improve it in two ways • Partition a random vector of length sw into several basic windows • Use convolution instead of inner product
How to construct a random vector Construct a random vector of 1/-1 of length sw. Suppose sliding window size=12, and basic window size=4 The random vector within a basic window is A control vector A final complete random vector for a sliding window may look like: (1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) Here Rbw=(1 1 -1 1) b=(1 -1 1) Rbw -Rbw Rbw
Naive algorithm and hope for improvement r=( 1 1 -1 1 ; -1 -1 1 -1; 1 1 -1 1 ) x=(x1 x2 x3 x4; x5 x6 x7 x8; x9 x10 x11 x12) • There is redundancy in the second dot product given the first one. • We will eliminate the repeated computation to save time dot product xsk=r*x= x1+x2-x3+x4-x5-x6+x7-x8+x9+x10-x11+x12 With new data point arrival, this operation will be done again r= ( 1 1 -1 1 ; -1 -1 1 -1; 1 1 -1 1 ) x’=(x5 x6 x7 x8 ; x9 x10 x11 x12; x13 x14 x15 x16) * xsk=r*x’= x5+x6-x7+x8-x9-x10+x11+x12+x13+x14+x15- x16
Our algorithm • All the operations are over the basic window; • Pad with |bw-1| zeros, then convolve with Xbw conv1:(1 1 -1 1 0 0 0) (x1,x2,x3,x4) conv2:(1 1 -1 1 0 0 0) (x5,x6,x7,x8) conv3:(1 1 -1 1 0 0 0) (x9,x10,x11,x12) x4 x4+x3 Animation shows convolution in action: -x4+x3+x2 1 1 -1 1 0 0 0 x4-x3+x2+x1 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x3-x2+x1 x2-x1 x1
Our algorithm: example First Convolution Second Convolution Third Convolution x8 x8+x7 x6+x7-x8 x5+x6-x7+x8 x5-x6+x7 x6-x5 x5 x12 x12-x11 x10+x11-x12 x9+x10-x11+x12 x9-x10+x11 x10-x9 x9 x4 x4+x3 x2+x3-x4 x1+x2-x3+x4 x1-x2+x3 x2-x1 x1 + + xsk1= (x1+x2-x3+x4)-(x5+x6-x7+x8)+(x9+x10-x11+x12) xsk2=(x2+x3-x4+x5)-(x6+x7-x8+x9)+(x10+x11-x12+x13)
Our algorithm: example sk1=(x1+x2-x3+x4) sk5=(x5+x6-x7+x8) sk9=(x9+x10-x11+x12) xsk1= (x1+x2-x3+x4)-(x5+x6-x7+x8)+(x9+x10-x11+x12)b= ( 1 -1 1) First sliding window sk2=(x2+x3-x4) + (x5)sk6=(x6+x7-x8) + (x9)sk10=(x10+x11-x12) + (x13)Then sum up and we have xsk2=(x2+x3-x4+x5)-(x6+x7-x8+x9)+(x10+x11-x12+x13) b=( 1 -1 1) Second sliding window (Sk1 Sk5 Sk9)*(b1 b2 b3) * is inner product
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 1 1 –1 1 1 1 –1 1 1 1 –1 1 x5+x6-x7+x8 x9+x10-x11+x12 x1+x2-x3+x4 Basic window version • Or if time series are highly correlated between two consecutive data points, we may compute the sketch every basic window. • That is, we update the sketch for each time series only when data of a complete basic window arrive. No convolution, only inner product.
Overview of our new algorithm • The projection of a sliding window is decomposed into operations over basic windows • Each basic window is convolved/inner product with each random vector only once • We may provide the sketches starting from each data point or starts from the beginning of each basic window. • There is no redundancy.
Performance comparison • Naïve algorithm For each datum and random vector (1) O(|sw|) integer additions • Pointwise version Asymptotically for each datum and random vector (1) O(|sw|/|bw|) integer additions (2) O(log |bw|) floating point operations (use FFT in computing convolutions) • Basic window version Asymptotically for each datum and random vector O(|sw|/|bw|2) integer additions
Big picture revisited time series 1 time series 2 time series 3 … time series n … sketch 1 sketch 2 … sketch n … Correlatedpairs Random Projection Grid structure Filtering So far we reduce the data dimension efficiently. Next, how can it be used as a filter?
How to use the sketch distance as a filter • Naive method: compute the sketch distance: • Being close by sketch distance are likely to be close by originaldistance (JL Lemma) • Finally any close data pair will be double checked with the original data.
Use the sketch distance as a filter • But we do not use it, why? Expensive. • Since we still have to do the pairwise comparison between each pair of stocks which is , k is the size of the sketches, e.g. typically 30, 40, etc • Let’s see our new strategy
Our method: sketch unit distance Given sketches: We have If f distance chunks have we may say where: f: 30%, 40%, 50%, 60% …c: 0.8, 0.9, 1.1…
Further: sketch groups We may compute the sketch group: Remind us of a grid structure For example If f sketch groups have we may say
Grid structure • To avoid checking all pairs, we can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs. • The data labeled as “potential” will be double checked using the raw data vectors.
Optimization in parameter space • How to choose the parameters g, c, f, N? • We will choose the best one to be applied to the practical data. But how? --- an engineering problem • Combinatorial Design (CD) • Bootstrapping N: total number of the sketches g: group size c: the factor of distance f: the fraction of groups which are necessary to claim that two time series are close enough Now, Let’s put all together.
X Y Z Inner product with random vectors r1,r2,r3,r4,r5,r6
Empirical study: various data sources • Cstr: Continuous stirred tank reactor • Fortal_ecg: Cutaneous potential recordings of a pregnant woman • Steamgen: Model of a steam generator at Abbott Power Plant in Champaign IL • Winding: Data from a test setup of an industrial winding process • Evaporator: Data from an industrial evaporator • Wind: Daily average wind speeds for 1961-1978 at 12 synoptic meteorological stations in the Republic of Ireland • Spot_exrates: The spot foreign currency exchange rates • EEG: Electroencepholgram
Empirical study:performance comparison Sliding window=3616, basic window=32 and sketch size=60
Section conclusion • How to perform data reduction over uncooperative time series efficiently in contrast to well-established methods for cooperative data • How to cope with middle-size sketch vectors systematically. • Sketch vector partition, grid structure • Parameter space optimization by combinatorial design and bootstrapping • Many ideas can be extended to other applications