1 / 30

Using the Crowd for Top-K or Group-By Queries

Using the Crowd for Top-K or Group-By Queries. Susan Davidson U. of Pennsylvania Sanjeev Khanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington. Example: Using Wisdom of Crowd in DB Queries. Database of football players pictures

major
Download Presentation

Using the Crowd for Top-K or Group-By Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Crowd for Top-K or Group-By Queries Susan Davidson U. of Pennsylvania SanjeevKhanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington ICDT 2013

  2. Example: Using Wisdom of Crowd in DB Queries Database of football players pictures • Multiple photos at different ages of each player Q. Group the photos of individual players Q. Find their most recent photos ICDT 2013

  3. How a DBMS Thinks Q. Group the photos of individual players Group-By Queries! Use “Name” attribute Q. Find their most recent photos Max/Top-k Queries! Use “Date” attribute • What if name/date is missing • Image processing? Photo forensics? ICDT 2013

  4. Ask the “Crowd”! ICDT 2013

  5. How the Crowd Thinks - 1 Q. Group the photos of individual players ICDT 2013

  6. How the Crowd Thinks - 2 Q. Find their most recent photos ICDT 2013

  7. Crowd Sourcing • Using human intelligence to do tasks which are harder to automate • A recent topic of interest in database and other research communities • Many crowdsourcing platforms ICDT 2013

  8. This talk: Using Wisdom of Crowd in Top-K and Group-By Queries Top-k/max A possible query • SELECTTOP 1 R.picture • FROMSoccerPlayerPhotoTableAS R • GROUP BY R.player • ORDER BY R.dateDESC Group By / Clustering Fixed but unknown attributes: R.player, R.date ICDT 2013

  9. Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) Combination is easy.. ICDT 2013

  10. Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) ICDT 2013

  11. Elements: Type and Value • #Elements = n • (n = 16) • Two attributes: Type and Value • Type • e.g. Name = “Maradona” • used in Group-By • #Types = J (J clusters, J = 4) • Value • e.g. Date when photo was taken • used in Top-K • Unknown, but “ground-truth’’ exists for Types and Values ICDT 2013

  12. Type and Value Comparisons Is the first photo older? Same Person? • DB Queries vs. Comparisons(questions to the crowd) • Type Comparisons: • Type(x) = Type(y)? • Value Comparisons: • Value(x) > Value(y)? • Assumes same type • Answer is Boolean • We cannot ask • “What is Type(x)/Value(x)?” • But, answers are not always correct • Crowd makes mistakes • Next, error model ICDT 2013

  13. Constant Error Model Type comparisons: Type(x) = Type(y)? Value comparisons: Value(x) > Value(y)? Constant error model: • Standard model • Probability of wrong answer ≤ ½ - ,  >0 Is the first photo older? Wrong: w.p. ≤ ½ -  Correct: w.p. ≥ ½ +  Same person? Wrong: w.p. ≤ ½ -  Correct: w.p. ≥ ½ +  ICDT 2013

  14. Variable Error Model (this paper) Is the first photo older? - Harder For value comparisons only : Value(x) > Value(y)? • Error probability ≤ 1/ f() -  •  = Distance of x, y in sorted order Is the first photo older? – Easier • f is • strictly monotone • x1 > x2 f(x1) > f(x2) • f(x) ≥ 2 e.g. f() = e f() =  + 1 f() = log  + 2 ≤ ½ -   = 2 x1 x2 x3 x4 • e.g. Error probability when f() = e • ≤ 1/e • ≤ 1/e2 ICDT 2013

  15. Framework at a Glance Value(a) > Value(b) Internal Computation (Top-K/ Group-By) Yes/No Value(b) > Value(c) Yes/No Crowd Crowd-sourced DBMS Comparison Error Crowd’s answer may be wrong with some prob. • Cost Model • Asking questions costs money • Additive • Count #comparisons We still want the correct answer w.h.p. ICDT 2013

  16. Our Goal Minimize the total #comparisons while outputting the correct answer(exact top-k or clusters) w.p.  1 – δ (given constant δ > 0) ICDT 2013

  17. Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) ICDT 2013

  18. Problem Statement: Max/Top-k • Input: n elements, k • Same type • Only value comparisons are used • Output: All top-k elements (w.p.  1 – δ) • (k = 2) • We focus on Max: Algorithm for top-k builds on the algorithm for max x3 x1 x2 x4 ICDT 2013

  19. Max: Our Results Required confidence: 1 – δ (δ = constant),  = distance in sorted order Asymptotically smaller than all c.n, c>0 • Further, • f() = exp.  n + O(1) • f() = linear  n + O(log log n) • f() = log.  n + O(n1/1+ δ) ICDT 2013

  20. Review: Tournament Tree for Max 10 Max #comparisons = c . 5 #comparisons = c . 3 Comparison at internal nodes 10 #comparisons = c . 1 9 5 9 10 2 n elements as leaves • Exact comparisons: Binary tree structure is not necessary • Noisy comparisons: Repeat comparison + majority vote • Constant error model: θ(n) algorithm (Feige et. al. ’94) • Our goal:Total no. of comparisons = n + o(n) • cannot repeat even twice in most of the internal nodes ICDT 2013

  21. Main Steps for Max Goal: n + o(n) • Upper levels • Use Feige’s algorithm • Y nodes  Ө(Y) cost Y = o(n) for any f Y = O(1) for f = e Upper Levels Y nodes Lower levels Just 1 comparison at each internal node No majority vote Binary tournament tree Lower Levels n nodes Key idea: Start with a random permutation of elements at the leaves • Max does not lose in the lower levels w.h.p. • Intuition: Max does not meet 2nd Max in the lower levelsw.h.p. • Total no. of comparisons = n + Ө(Y) = n + o(n) ICDT 2013

  22. Extension to Top-K • Corollary • If k = o(n), • n + o(n) comparisons suffice to find top-k w.h.p. • x1, …, xkdo not meet in the lower levels ICDT 2013

  23. Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) • Simple Clustering (using types only) • Clustering with Correlated Values and Types ICDT 2013

  24. Simple Clustering • Group n elements into J clusters • Only type comparisons • Constant error model • O(nJ) comparisons are sufficient • Log factor for comparison error • J can be unknown • Not surprising • We show Ω(nJ) lower bound • Even if no comparison error • Even for randomized algorithms • Even when J is known ~ ICDT 2013

  25. Clustering withCorrelated Types and Values 3-br 2-br Apts. In a building Type = #bedrooms Value = rent 1-br Type 1 Type 3 Type 2 Low values High values • Full correlation: • Elements of same type form contiguous blocks in the sorted order on values • For no correlation, we gave a lower bound of Ω(nJ) Assume no error ~ • O(n logJ) value and typecomparisons suffice to find the clusters w.h.p. ICDT 2013

  26. Algorithm for Full Correlation Type 1 Type 2 Type 3 J = 3 s = 13 - Find median and partition RepeatUntil each block has one type - If ≥ 2 types in a block, Repeat - Same type in a block, keep only one element • Linear cost in total in each inner iteration • #Inner iterations = O(log J) • Cost: C(n) ≤ C(n/2) + O(n log J) • C(n) = O(n log J) - Until #elements ≤ s/2 4J blocks  at most J blocks contain ≥ 2 types,  ≤ s/2 elements ICDT 2013

  27. Extension to Partial Correlation ≤ α changes in the same type Hotels in a city Type = Star-rating Value = Avg. price Type 1 Type 3 Type 1 Type 2 • Partial correlation: • Elements of same type form almost contiguous blocks • O(n log (α J) + αJ) value and typecomparisons suffice w.h.p. ~ ICDT 2013

  28. Related Work Max/Top-K: • [Guo et. al. ’12]: Which element has the maximum likelihood of being max, which future queries are most effective • [Venetis et. al. ’12]: Efficient heuristics that tune latency, cost, quality to find max • [Feige et. al. ’94]: Tight bounds for Max/Top-K/Sorting in constant error model • Other error model and objectives exist in theory literature Clustering • [Gomes et. al. ’11]: Machine learning approach to define clusters • [Parameswaran et. al. ’12, Wang et. al. ’12]: Filtering data, entity resolution Other Work: • Crowd sourced DBs: CrowdDB, Deco, Qurk, MoDaS… • Crowd sourcing for data cleansing, data integration, entity resolution, data analytics, … ICDT 2013

  29. Conclusion • We studied Max/Top-k and Group-By queries in crowd-sourced setting • Top-K: Proposed a variable error model, allows fewer comparisons • Group by: Obvious simple algorithm is the best possible, correlation reduces #comparisons • Future Work: • Other objectives: latency, #rounds, or to reduce error when a budget on #comparisons is given • Other cost functions: e.g. more #similar queries in a round, less cost ICDT 2013

  30. Thank You Questions? ICDT 2013

More Related