140 likes | 300 Views
Crowd Algorithms. Scoop — The Stanford – Santa Cruz Project for Cooperative Computing with Algorithms, Data, and People .
E N D
Crowd Algorithms Scoop — The Stanford – Santa Cruz Project for Cooperative Computing with Algorithms, Data, and People Hector Garcia-Molina, Stephen Guo, AdityaParameswaran, Hyunjung Park, AlkisPolyzotis, PetrosVenetis, Jennifer Widom Stanford and UC Santa Cruz
The Goal • Design Fundamental Algorithms for Human Computation Latency • Which questions do I ask? • When do I ask the questions? • When do I stop? • How do I combine the answers? Uncertainty Cost
The Problems • Crowd- • Crowd- • Crowd- • Crowd- • Sort / Max • GraphSearch • Categorize • Filter • : Difficult! • : Difficult! • : Difficult! • : Difficult! [VLDB 2011] Summaries of the rest Progress! The focus of this talk. Latency Uncertainty Cost
Filters Is this image that of Bytes Café ? • Given: • Error Probability (FP/FN) & Selectivity for each predicate • Desired Overall Error Probability • To: Compose a filtering strategy • Minimize Overall Cost (# of questions) Predicate 1 Dataset of Items Is the image blurry? Filtered Dataset Predicate 2 Does it show people’s faces? …… Predicate k • Which questions do I ask? • When do I ask the questions? • When do I stop? • How do I combine the answers?
Single Filter • Surprisingly difficult! • Need to meet an overall error threshold • Say, up to 10% of my images may be wrongly filtered • Minimize overall expected number of questions • Boils down to the following: • Take one item • Ask some questions • Results in a certain number of (Y, N) for a given item • Do I stop (if so, what do I return), or do I continue asking? Dataset of Items Filtered Dataset Predicate 1
Hasn’t this been done before? • Solutions from statistics guarantee the same error per item • Important on contexts like: • Automobile testing • Diagnosis • We’re worried about aggregate error over all items: a uniquely data-oriented problem • I don’t care if every image is perfect as long as the overall error is met. • As we will see, results in $$$ savings
Strategies YES Answers • Reformulated Task: • For each point in grid : Return Pass/Fail/Cont. • Equivalently, • Find the best shape and color it! YES = 5, NO = 6 Return “Passed” YES = 3, NO = 5 Continue YES = 3, NO = 7 Return “Failed” Start here, with no questions NO Answers
Common Strategies • Always ask X questions, return most likely answer • The triangle shape • If you get X YES, return “Pass” or Y NO, return “Fail”, else keep asking. • Rectangular shape • Ask until |#YES - #NO| > X, or at most Y questions • Chopped off rectangle • Anhai’s work on MOBS
Summary of Results • A characterization of which “shapes” are optimal • A optimal PTIME “probabilistic” approach • LP leveraging the inherent DP structure • Optimal: Strategy with minimum overall cost • for given parameters and requirements • Probabilistic: Probability of “Pass” “Fail” “Continue”
Empirical Results Generate Parameters • Evaluation on 10000 synthetic scenarios • Tested: • Optimal, Brute Force, Statistical, 5 Heuristic Algorithms • Optimal Probabilistic issues fewer questions overall • 15% savings on average compared to brute force • 32% savings when optimal wins • 22% savings on average compared to the statistics approach • 49% savings when optimal wins Brute Force Deterministic Optimal Probabilistic Other Algorithms >> >> COST1 COST2 COST3 Translates to $$$ for many items !!
Crowd-Max/Sort • The problem(s): • Find the strategy of sorting n items • Given: Probability of error for a comparison • Given: Desired threshold on error,#questions,#rounds • Sorting automatically given evidence • NP-Hard even for a simple probability of error model • Related work in the area of voting theory, economics • Which r questions do we ask next? • One question in each round • Ask all pairs a total of 2k/n times • Tournament, with k repetitions at each level Decreasing Parallelism More Accuracy
Crowd-GraphSearch Image Categorization Example To attach: image of a honda car Is image one of vehicle? vehicle YES! car Is image one of toyota? NO! nissan honda toyota Is image one of honda? maxima sentra YES! target node = intended category Is the image one of X? = Is the target node reachable from X? Find the target node by asking minimum number of search questions.
Crowd-Categorize • k buckets, n items • Categorize every item, overall error < threshold • For k = 1, same as filters problem • Two versions: • Discrete • Independent (like in the filters case) • Dependent buckets (e.g., colors, GraphSearch) • Continuous (e.g., age) Dataset of Items …….