1 / 28

Yaxiong Zhao and Jie Wu The speaker will be graduating next summer

Achieving fast (approximate) event matching in large-scale content-based publish/subscribe networks. Yaxiong Zhao and Jie Wu The speaker will be graduating next summer. Outline. Background Content-based pub/sub networks Counting algorithm The design Problem formulation

kolton
Download Presentation

Yaxiong Zhao and Jie Wu The speaker will be graduating next summer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Achieving fast (approximate) event matching in large-scale content-based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating next summer

  2. Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation

  3. Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation

  4. Content-based pub/sub networks Pub/sub provides hustle-free messaging between users Title = {xyz} ᴧconference = ICDCS ᴧ keywords = {content} Content-based pub/sub (CBPS) provides yet more expressiveness Message • Content representation model: how to describe contents? • Boolean expression model • Attribute constraints: a primitive description of the constraints on a attribute’s value • Asubscription/filter is defined as a conjunction of multiple attribute constraints • Each message has multiple attribute assignments Pub/sub routing Subscription conference = ICDCS ᴧ keywords ∈{content, pub, sub}

  5. Preliminary: counting algorithm • Since each subscription is a conjunctive form of multiple attribute constraints • A message matches a subscription i.f.f. it matches all attribute constraints of the subscription • Counting algorithm works by counting the number of matched attribute constraints • Matches if the number is equal to the number of all the attribute constraints of the subscription Title = {xyz} ᴧconference = ICDCS ᴧ keywords = {content} Matches 2 attribute constraints so it matches the subscription conference = ICDCS ᴧ keywords ∈{content, pub, sub}

  6. Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation

  7. Dissecting the counting algorithm • The algorithm goes through two steps • Retrieving matched attribute constraints • The most time consuming stage • The focus of this paper • Comparing the number of matched constraints with the number of constraints for each subscription and return matched subscription • It’s possible to shortcut this process

  8. Optimizing the retrieval stage • The naïve approach, i.e. examining all attribute constraints, only works for very small amount • A few thousand • More intelligent techniques • Binary search to eliminate unmatched constraints • SIENA (Sigcomm’03) • Clustering possibly-matched constraints • Faster matching (Sigmod’01)

  9. Range-based attribute constraints • Range-based attribute constraints represent an cell that an attribute’s value should be in • Can be seen as a conjunction of two primitive attribute constraints • General enough to replace primitive attribute constraints • A primitive constraint can be translated to a range-based constraint • Range-based constraints are highly desirable Height > 100 & Height < 200  Height ∈ (100, 200) Height < 200 => Height > 0 & Height < 200

  10. Basic idea • Reverse indexing of subscriptions • Instead of check each subscription to see what assignments match it • Directly retrieve the matched attribute constraints for each attribute assignment We need a data structure to mapping between attribute value and attribute constraints

  11. Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation

  12. Reverse indexing • Indexing subscriptions through their represented ranges for each attribute • [100 < height < 200] • [100, 200] <-> subscription1 • Given an assignment “height = 150” we can find all its matched subscriptions by find the ranges containing this value

  13. Discretization • Represent an arbitrary range by evenly separated cells • Mapping between value and its corresponding cell is fast • Results in false positive/negative (approximate matching) • We only accept false positives • Guarantee user satisfaction [0, 1, 2] is used to represent this attribute constraint

  14. Index tree and subscription discretization • The naïve discretization has a scalability issue • The worst-case false positive is • 2 * cell-length / range-length • To achieve a false positive of p • The number of cell is 1/p • An analog is counting numbers • Need exactly “n” tokens to represent the number “n”

  15. A very simple remedy • Just like counting in positional notation • Binary separating the attribute value space • Each cell is evenly divided into two in the next level • Log(1/p) cells • Much better scalability {level 2 : 1; level 3: 1, 4; level 4 : 1, 10}

  16. Working with counting algorithm • Each range attribute constraint is discretized on multiple levels • Each cell ID associates with a subscription ID • Retrieval stage • Table lookup • Counting stage • Incremental the counters for subscription IDs • Comparing counters’ values to the number of attribute constraints of subscription cell IDs Attributes Subscription IDs levels

  17. Implementation • Data structure organization • Matching table: for discretized cell ID and subscription ID mapping • Subscription ID and attribute constraint counts mapping • Linear table • More details are in the paper

  18. Dynamic matching for shortcutting matching process • In many situations we do not need to find all matched subscriptions • Interface matching • Stop matching once any one of the associated subscription matches • Just examine a fraction of the discretization levels

  19. Optimal binary separation • The above analysis assumes uniform distribution of the attribute values of events • The analysis holds for non-uniform distribution only when • Event values are evenly distributed on each cell • I.e. the number of events fall into each cell is the same • Optimal binary separation does this • Bisecting a range at its median • Ensure that each cell contains the same amount of events If 90% event’s attribute values fall into here

  20. Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation

  21. Eliminating false positives • Two types: interface false and subscription • Matches a wrong interface • Matches a wrong subscription • Situations that no false positive occurs • Interface/subscriptions matched before the last discretization level • Can be used to short-cutting interface matching • To eliminate false positives • Double check the subscriptions that are matched at the last discretization level

  22. Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation

  23. Experiment settings • A working prototype written in C++ • As a forwarding component in Siena • Total number of attributes is 1,000 • Number of attributes per event 100 – 1,000 • Relative width of attribute constraints 0.01 – 0.1

  24. Subscription matching time (degenerate case) • With of attribute constraints: 0.01 (relative to the entire value space) • 10 range constraints per subscription • Each event has 100 attribute assignments • 3.23ms to return all matched subscription for an event with 20 million attribute constraints • Orders of magnitude faster than Siena

  25. Interface matching speeding w/ vs w/o shortcutting • Fix the total number of subscriptions to 20,000 • Vary the number of subscriptions (filters) per interface • Present the changes of interface matching with two different widths of attribute constraints Relative value of the width of attribute constraints

  26. Subscription matching FPR • Stores 20,000 subscriptions • Vary the width of attribute constraints and the number attributes per subscription • 108 events • Feed 10,000 events to the matching table

  27. Interface matching FPR • 20,000 subscriptions • 2,000 subscriptions per interface • Change the width of range constraints and the number of attributes per subscription

  28. Q&A Thank you for listening! yaxiong.zhao@temple.edu Drop me an Email if your questions are not answered here

More Related