180 likes | 289 Views
This paper explores the application of advanced crowdsourcing methods to collect and label data effectively and efficiently. The authors, Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), and Alexander Ihler (UC Irvine), discuss challenges in achieving accurate estimates from noisy labels, employing various baseline methods such as Majority Voting and the Two-coin and One-coin models for performance evaluation. The study also addresses the complexity of model selection and highlights unique datasets, including the Google Fact Judgment and CrowdFlower Sentiment Judgment datasets, focusing on classification accuracy and minority class significance.
E N D
Crowdscale Shared Task Challenge 2013 Qiang Liu (UC Irvine), JianPeng (MIT CSAIL), Alexander Ihler (UC Irvine)
Crowdsourcing • Collect data and knowledge at large scale Experts: Time-consuming & expensive Crowdsourcing: Combine many non-experts
Crowdsourcing for Labeling • Goal: estimate true zifrom noisy labels {Lij}. … Tasks: … Workers:
Baseline Methods • Majority Voting • All the workers have the same performance
Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) Worker j’ Answer True Answer
Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter Worker j’ Answer True Answer
Baseline Methods • Majority Voting • All the workers have the same performance • Two-coin Model (Dawid & Skene 79) • Each worker characterized by a confusion matrix • Learned by expectation maximization (EM) • One-coin Model • Each worker characterized by an accuracy parameter • Other methods: • GLAD [Whitehill et al 09], Belief propagation [Liu et al 12], Minimax entropy [Zhou et al 12] …
In Practice … • Model Selection • Standard models may not work • Special structures on the classes • Unbalanced labels
Two Datasets • Google Fact Judgment Dataset • 42,624 queries; 57 trained raters; 576 gold queries • Answers: {No, Yes, Skip} • CrowdFlower Sentiment Judgment Dataset • 98,980 questions; 1,960 workers; 300 gold queries • Answers:0 (Negative), 1 (Neutral), 2 (Positive), 3 (not related), 4(I can’t tell) • Special classes “skip”, “I can’t tell” • Ambiguity of queries
Evaluation Metric • Averaged Recall: • Special classes “skip”, “I can’t tell” • Included in the evaluation?
Important Properties • Unbalanced labels (on the gold data) 531 CrowdFlower Data Google Data 92 72 70 Only 9 instances in the reference data 57 26 19 No Yes Skip 9 1 (Neutral) 2 (Positive) 0 (Negative) 4(I can’t tell) 3 (not related)
Evaluation Metric • The importance of minority classes are up-weighted. 531 Class “Skip” is 531/26 ≈ 20 times more important than Class “Yes” 26 19 • Difficult to predict minority classes • E.g., Only 9 “I can’t tell” in the gold data, difficult to generalize No Yes Skip Overfitting!
Google Fact Judgment Dataset • Model selection (MV, one/two-coin EM): • Majority vote is the best • 57 “trained” workers • High and uniform accuracies • But not good enough … 0.7 # of workers Workers’ accuracies
Google Fact Judgment Dataset • Our Algorithm: For each query i • Count the percentages of labels submitted by the raters: ci(yes), ci(no), ci(skip) ci(yes) > 0.4 labeli = yes ci(no) > 0.8 labeli= no otherwise labeli= skip End Return {labeli}
CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17
CrowdFlower Sentiment Judgment Dataset • Model selection: • One-coin EM is best # of workers Workers’ accuracies • Overall confusion matrix: 0 1 2 3 4 4 3 2 1 0 256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17 Removing Class 4 may improve performance
CrowdFlower Sentiment Judgment Dataset • Our algorithm: • Remove all class 4 in the data, run one-coin EM get posterior distributions on the remaining classes: 2. If ci(4) > 0.5 or entropy( ) > log(4) – 0.27, then endif Return