1 / 40

Approximate Query Processing (AQP) in Data Streams

Approximate Query Processing (AQP) in Data Streams. Zahid Irfan & Dr. Asim Karim (Advisor) (zahidi, akarim @lums.edu.pk). CS-509-Masters of Science (CS) Project Lahore University of Management Sciences, Lahore, Pakistan 8 May 2004. Acknowledgement.

marli
Download Presentation

Approximate Query Processing (AQP) in Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Query Processing (AQP) in Data Streams Zahid Irfan & Dr. Asim Karim (Advisor) (zahidi, akarim @lums.edu.pk) CS-509-Masters of Science (CS) ProjectLahore University of Management Sciences,Lahore, Pakistan 8 May 2004

  2. Acknowledgement • This work is primarily based on the research paper “One-pass wavelets decompositions of data streams” by Gilbert, Muthukrishnan, Strauss and Kotidis, IEEE Trans. Knowledge and Data Engineering May/June, 2003. • Work by Muthukrishnan, Piotr Indyk and of course Johnson-Lindenstrauss.

  3. Introduction Streams and Streaming Models Wavelet Transform & Embedded Vectors Pseudo-Random Number Generator Implementation Details Test Results Conclusions and Future Work AQP in Data Streams

  4. Introduction • Lets solve a puzzle. Guess the missing number in a random sequence of numbers [1…N] without repetition. • Space Requirements O (1). • Time Complexity O (n). • What about two numbers, three numbers …. and so on…

  5. Data Streams • Data Stream • “A sequence of digitally encoded signals used to represent information in transmission”. • Input stream is the sequence a [i], arrives sequentially item by item.

  6. Data Streams Applications • Applications • Networks Data Monitoring. • Applied to Traffic Flow Analysis • World Wide Web. • Website hits, statistics etc. • Online Transactions Processing System • Large Databases Query Processing

  7. Stream Models • Time Series • Comprises value of the same quantity over different time intervals. • Typical examples • Daily closing values of Stock Exchange • Traffic at an IP-Link at time intervals.

  8. Stream Models • Cash Register Model • Positive updates arrive over period of time. • Typical examples • well … Cash Register • Cricket Scores • Internet web-site hits or other statistics.

  9. Stream Models • Turnstile Model • Fully dynamic model • Updates are both negative & positive • e.g. Passengers in an airport • Relative Hardness • Turnstile > Cash Register > Time Series • “Depends and varies from application to application”.

  10. Wavelet Transform • Wavelets • A mathematical hierarchical tool for decomposition of signals/ functions. • Types of Wavelets • Haar Wavelets • Daubechies Wavelets • Many more…

  11. Resolution Averages Detail Coefficients 3 D = [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 [1.5, 4] [0.5, 0] 0 [2.75] [-1.25] [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Haar Wavelet Decomposition Haar Wavelet Example

  12. Wavelet in <,> Space • Haar Wavelets can be represented as the following. • Example vector A of N=4, 4 coefficients. • W1= 1/N*[1 1 1 1], W2 = 1/N*[1 1 -1 -1], W3=1/N*[1 -1 1 -1], W4=1/N*[1 1 1 -1] • 1st Coefficient = <A,W1>. Average Coefficient • 2nd Coefficient = <A,W2>. Detail Coefficient • 3rd Coefficient = <A,W3>. Detail Coefficient • 4th Coefficient = <A,W4>. Detail Coefficient

  13. Embedding Vectors • Embedding Vectors • Any n-point metric space can be embedded into an O(log2 n) dimensional Euclidean space and L1 metric with 1+є distortion • f(v) = embedding for vector • v = < <v, r1>, <v, r1>, … <v, rk> >

  14. Johnson-Lindenstruass Lemma • Johnson-Lindenstrauss (JL) Lemma • Simply stated <a,b>~<a,rj>*<b,rj> • Where j=1…k, k<<N • rj is random vector= {1, -1 with equal probability} • Implications • Represent a vector in RN space in k-dimensional space. • Benefits : Approximate Queries… ??

  15. AQP & JL-Lemma • <a,b>~<a,rj>*<b,rj> • Approximate queries can be used by choosing special b. • Query ith value choose b=[ 0..010…0] • Range Query (i,j) value choose b=[ 0..01..10…0], where b[x]=1 for i<=x<=j. • What's the catch?? … rj is also size of N. So where to store the random vectors??

  16. Pseudo-Random Generator • Solution to large space over head is generate the random vectors on the fly!! • Such as : for (i=0;i<k;i++) { srand (i); for (j=0;j<N;j++) { rand (); } } • This solution works but there is a more elegant solution to this problem. Reed-Muller Codes Extractor.

  17. The Matrix values represent RM codes. RM (x,y)= Replace 01 & 1  -1 we get wavelet basis vectors. Reed-Muller Generator

  18. Reed-Muller PR Generator • Benefits of Reed-Muller Pseudo Random generator • Generated on the fly. • Every value is independently computed without anything to do with the previous values. • Most nearly imitates Wavelet basis vectors. • Hence the sketch contains most of the energy of the signal.

  19. Lessons so far !! • Things learnt so far • There is a way to embed the N data into k<<N vectors • JL-Lemma : <a,b>~<a,r><b,r> • Reed-Muller Codes excellent imitators of both wavelet basis vectors as well as random vectors. • Query Processing is possible thanks to JL- Lemma.

  20. Implementation Details • Implementation Trivia • Implemented in Visual C++ 6.0 • Design follows Classes and Objects paradigm • Test Results and graphs from MS Excel

  21. Data Flow Diagram

  22. Dataset Generator • Synthetic Data Set was generated using Random Distributions. • Normal Distribution • Calling Telephone Number • 9497000~9497999 (1000 lines) • Receiving Telephone Number • Exponential Distribution • Call Time • 0~512 minutes

  23. Data Streamer • The data streaming class offers methods, which help in useful imitation of a real-time data stream by continuously presenting the program with data. • Type DataStreamer::getData();

  24. Pseudo Random Generator • This class calculates the Reed-Muller based Pseudo-random Numbers. • type PseudoRandomGenerator::getRandom (int X,int Y); • Uses the formula

  25. Data Decomposition • The data is decomposed into a sketch by calculating the dot product of data stream with O (log N) random vectors. • The sketch is stored into Main Memory to be utilized by the query processing engine. • Sketch [j]+=Data [i]*Random (i, j); • Here i=(1,N) and j=(1,k);

  26. Query Processing Engine • The Query Processing Engine uses the sketch and a new vector b. • Uses the same old JL-Lemma • <a,b>~<a,rj>*<b,rj> • Setting various values of b result in theoretically any sort of query.

  27. Point Query Processing • Point Query • Point Query can be processed by asking for any single value in the whole data stream. • Point Query Algorithm • Prepare b[i]={0 for i !=j , 1 for i=j} and generate <b,r> • QuerySketch[j] +=B[i] * Random (i,j); • Result = (DataSketch * Query Sketch)/ N

  28. Range Query Processing • Range Query • Range Queries specify the low and high between which the query is to be processed. • Even multiple ranges can be specified • Query Algorithm • Prepare b[i]={0 for i !=j , 1 for i=j} and generate <b,r> • QuerySketch[j] +=B[i] * Random (i,j); • Result = (DataSketch * Query Sketch)/ N

  29. AQP Test • Time Complexity Analysis • Query Processing Accuracy with Data Size • Query Processing Accuracy with Sketch Size

  30. Time Complexity • Time Complexity • The following Time complexities were found to be linear in size of data. • Sketching Time • Query Processing Time

  31. Time Complexity (Sketching)

  32. Time Complexity (Query)

  33. Accuracy versus Data Size • Data Size versus Accuracy of Query • PSNR (dB) versus Data Size • Data Size is increased by Power of 2 • Sketch size assumed to be log N

  34. PSNR (dB) versus Data Size

  35. Accuracy versus Sketch Size • Accuracy of Query against the Sketch Size. • PSNR (dB) versus Sketch Size • Data Size is assumed to be constant = 32768 • Sketch Size is varied

  36. PSNR (dB) versus Sketch Size

  37. Conclusions • Space Complexity Reduction • Prohibitively large data stream in sub-linear space. • Time Complexity Reduction • one-pass data stream algorithm. • Scalability to multi-dimensions

  38. Applications and Future Work • Data Mining Streams • Multimedia & Databases • Trying it with Video coding might be fun or disaster  • Graph Theory Problems • MST, Matching etc. need to be solved in the streaming model. • Computational Geometry • Earth observation data streams or weather data streams • Solve any problem that can be modeled as a data stream

  39. References • S. Acharaya, P.B. Gibbons, V. Poosala and S. Ramaswamy, “Join Synopsis for Approximate Query Answering”, ACM In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999. • J. M. Hellerstein, P. J. Haas and H. J. Wang, “Online Aggregation”, In the Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, 1997. • Y. E. Iaonnidis and V. Poosala, “Histograms-Based Approximation to Set-Valued Query Answers”, In the proceedings of 25th International Conference on Very Large Databases, 1999. • K. Chakrabarti, M. Garofalakis, R. Rastogi and K. Shim, “Approximate Query Processing Using Wavelets”, The Proceedings of the 26th Conference on Very Large Databases, Eygpt, 2000. • F. Olken, “Random Sampling in Databases”, PhD Thesis, University of California at Berkeley, 1993. • A.C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strass, “One-pass wavelet Decomposition of Data Streams”, IEEE Transactions of Knowledge and Data Engineering, Vol. 15, No.3, May/June 2003. • A. Ta-Shma, D. Zuckerman, and S. Safra, “Extractors from Reed-Muller Codes” In Proceedings of 42nd Annual IEEE Symposium on Foundations of Computer Science, 2001.

  40. Questions & Answers Thanks to the following for their sincere help in this project Dr. Asim Karim, Dr. Sarmad Abbasi, Dr. Asim Loan, Dr. Sohaib A. Khan and all my friends specially Laeeq Aslam and Aimal Tariq Rextin.

More Related