1 / 40

PIRS: Query Verification on Data Streams

PIRS: Query Verification on Data Streams. Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs.

vianca
Download Presentation

PIRS: Query Verification on Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PIRS: Query Verification on Data Streams Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs work done while the 1st and 2nd authors were working at AT&T labs.

  2. Publishing Data and Outsourcing Query Service Network 0 1 1 0 0 1 … 1 1 0 … IP Traffic Streamcoming from Gigascope:analysis tool by Results statistics

  3. Revisiting the CISCO – AT&T Example Network Gigascope IP Traffic Stream 0 1 1 0 0 1 … 1 1 0 … statistics lawyers: sign the trust agreement Could we help? (computer scientists)

  4. Concrete Example IP Stream: . . . pm p3 p2 p1 : srcIP, destIP, packet_size Continuous Query: SELECT SUM(packet_size) FROM IP_trace GROUP BY srcIP, destIP Answer: Groups Time

  5. Continuous Query Verification (CQV) on Data Streams Group 1 Group 2 • Client register query • Server reports answer • upon request Group 3 Server maintains exact answer … … Source of streams … Client maintains synopsis X Both client and server monitor the same stream SELECT SUM(packet_size) From IP_Trace GROUP BY src_ip, dest_ip

  6. The Model for the Stream T=3 T=1 T=2 agg_attribute | group_id 1|1 9|1 7|i … S VT 10 9 0 0 0 … 0 7 0 V1 V2 V3 Vi Vn

  7. no alarm Alarm 10 0 0 … 7 0 V1 V2 V3 Vi Vn Continuous Query Verification: CQV T=1 T=2 T=3 9|1 7|i 1|1 … S Update X Update V VT 0 9 10 0 0 … 7 0 0 XT V1 V2 V3 Vi Vn Synopsis 9 0 10 0 2 … 5 0 0 V1 V2 V3 Vi Vn

  8. PIRS: Polynomial Identity Random Synopsis choose prime p: chose a random number : raise alarm if not equal o/w no alarm

  9. Incremental Update to PIRS T=1 T=2 9|1 7|i 1|1 … S update to v1 update to vi update to v1 An update to group i with value u could be done in logu time (exponential by squaring):

  10. happens at no more than m values of x It Solves CQV problem! Theorem: Given any PIRS raises an alarm with probability at least 1-δ a polynomial with 1 as the leading coefficient is completely determined by its zeroes Due to the fundamental theorem of algebra. Since we have p>m/ δ choices for a: the probability that X(V)=X(W) is at most δ

  11. Optimality of PIRS Theorem: PIRS occupies O(log m/δ + log n) bits of space (3 words only at most, i.e., p, a, X(V)), spends O(1) time to process a tuple for count query, or O(log u) time to process a tuple for sum query. Theorem: Any synopsis for solving the CQV problem with error probability at most δ has to keep Ω(log min{n,m}/δ) bits.

  12. Multiple Queries Q1 Q2 Q1 Q2 V1..n2 V1..n1 V1..(n1+n2) X1 X2 X Theorem: our synopses use constant space for multiple queries. 9|1,8 … S update to v1 update to v8

  13. Handle the Load Shedding • Semantic Load Shedding: drop tuples from certain groups • Small number of groups having errors • Random Load Shedding: • All groups have small amount of errors

  14. CQV with Semantic Load Shedding Randomly drop certain tuples according to groups 9|1 7|i 2|j 1|1 4|k 5|1 … Server claims at most γ number of groups have errors To detect if more than γ groups having errors! We have designed synopses using O(γ log 1/δ log n) bits of space and achieve the error probability at most δ

  15. PIRSγ: An Exact Solution b(8)=2 Alarm v8 If at least one layer raises alarms … PIRS PIRS PIRS k buckets Alarm log 1/δ … If at least buckets raise alarms … PIRS PIRS PIRS

  16. PIRSγ: An Exact Solution Theorem: PIRSγ requires O(γ2 log1/δ logn) bits, spends O(log1/δ ) time to process a tuple and solves CQV with semantic load shedding.

  17. Intuition on Approximation the approximation probability to raise alarm the ideal synopsis number of errors γ γ- γ+

  18. PIRS±γ: An Approximate Solution Theorem: PIRS±γ requires O(γ log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple.

  19. CQV with Random Load Shedding Randomly drop tuples All groups have small errors To detect if any group has error greater than a claimed threshold Theorem: Any synopsis solves this problem with error probability at most δ requires at least Ω(n) bits (reducing to the problem of estimating infinite frequency moment: the number of occurrence of the most frequent item).

  20. Sliding Window and Other Queries • It is easy to extend PIRS to work with sliding window model since it is decomposable, i.e., X(v1+v2)=X(v1)*X(v2). • Other queries that can be transformed into Group By aggregation queries. • Details in the paper.

  21. Some Experiments • We use real streams: • World Cup Data (WC) • IP traces from the AT&T network (IP) • We perform the following query: • WC: Aggregate on response size and group by client id/object id (50M groups) • IP: Aggregate on packet size and group by source IP/destination IP (7M groups) • Hardware for the client: • 2.8GHz Intel Pentium 4 CPU • 512 MB memory • Linux Machine

  22. Detection Accuracy Over 100,000 random attacks, PIRS identifies all of them.

  23. Memory Usage of Exact Exact’s memory usage is linear and expensive. PIRS using only constant 3 words (27 bytes) at all time.

  24. Update Time (per tuple) of Exact Cache misses and memory swap • Exact is fast when memory usage is small. • It becomes extremely slow due to cache misses and memory swap operations.

  25. Running Time Analysis Average Update Time IPs exhibits smaller update cost for sum query as the average value of u is smaller than that of WC

  26. Multiple Queries: Exact Memory Usage Exact’s memory usage is linear w.r.t number of queries and increasing over time. PIRS always using only constant 3 words (27 bytes).

  27. Multiple Queries: Exact Update Time Per Tuple

  28. Multiple Queries: PIRS Update Time Per Tuple

  29. The Library Download PIRS and other synopses at: http://www.cs.fsu.edu/~lifeifei/pirs/

  30. Conclusion • Space and Update efficient synopsis for verifying continuous group-by aggregation queries on streaming data; • Could be generalized to handle selection query, and sliding-window semantics; • How about more complicated queries?

  31. Thanks! • Questions

  32. Problem and Goals • Assumption: • Client and DSMS observe the same stream • Problem: • Client needs to verify the results • Goals: • Be memory, update efficient • Tolerance for a limited number of errors • Tolerance for small errors • Support multiple queries

  33. Related Techniques to PIRS • Incremental Cryptography • Block operation (insert, delete), cannot support arithmetic operation • Program Verification • Server may pass the program execution but simply return random outputs • Fingerprinting Technique • PIRS is a fingerprinting technique

  34. CQV with Semantic Load Shedding

  35. PIRS±γ: An Approximate Solution Theorem: PIRS±γ: 1.raises no alarm with probability at least 1- δ on any 2.raises an alarm with probability at least 1- δ on any For any c>-lnln2=0.367 Using the intuition of coupon collector problem and the Chernoff bound.

  36. PIRS±γ: An Approximate Solution Alarm If majority layers raise alarms bi=2 vi … PIRS PIRS PIRS k buckets Alarm log 1/δ … If all k buckets raise alarms … PIRS PIRS PIRS

  37. Information Disclosure on Multiple Attacks PIRS: X(V) on r R Insight: server could potentially gets rid of δ portion of seeds from each notified failed attack! Learns nothing about r

  38. Information Disclosure on Multiple Attacks Bob Theorem: For the total of k attacks made by Bob to PIRS, the probability that none of them succeeds is at least 1-kδ.

  39. Proof of the Optimality

  40. Proof of the Optimality

More Related