Crowdsourcing using Mechanical Turk
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Panos Ipeirotis - Introduction PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University. Panos Ipeirotis - Introduction. New York University, Stern School of Business. “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/

Download Presentation

Panos Ipeirotis - Introduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Panos ipeirotis introduction

Crowdsourcing using Mechanical TurkQuality Management and Scalability Panos Ipeirotis – New York University


Panos ipeirotis introduction

Panos Ipeirotis - Introduction

  • New York University, Stern School of Business

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com/

Email: [email protected]


Example build an adult web site classifier

Example: Build an “Adult Web Site” Classifier

  • Need a large number of hand-labeled sites

  • Get people to look at sites and classify them as:

    G (general audience)PG (parental guidance) R(restricted)X (porn)

  • Cost/Speed Statistics

  • Undergrad intern: 200 websites/hr, cost: $15/hr


Amazon mechanical turk paid crowdsourcing

Amazon Mechanical Turk: Paid Crowdsourcing


Example build an adult web site classifier1

Example: Build an “Adult Web Site” Classifier

  • Need a large number of hand-labeled sites

  • Get people to look at sites and classify them as:

    G (general audience)PG (parental guidance) R(restricted)X (porn)

  • Cost/Speed Statistics

  • Undergrad intern: 200 websites/hr, cost: $15/hr

  • MTurk: 2500 websites/hr, cost: $12/hr


Bad news spammers

Bad news: Spammers!

  • Worker ATAMRO447HWJQ

  • labeled X (porn) sites as G (general audience)


Improve data quality through repeated labeling

Improve Data Quality through Repeated Labeling

  • Get multiple, redundant labels using multiple workers

  • Pick the correct label based on majority vote

11 workers

93% correct

1 worker

70% correct

  • Probability of correctness increases with numberof workers

  • Probability of correctness increases with quality of workers


But majority voting is expensive

But Majority Voting is Expensive

  • Single Vote Statistics

  • MTurk: 2500 websites/hr, cost: $12/hr

  • Undergrad: 200 websites/hr, cost: $15/hr

  • 11-vote Statistics

  • MTurk: 227 websites/hr, cost: $12/hr

  • Undergrad: 200 websites/hr, cost: $15/hr


Using redundant votes we can infer worker quality

Using redundant votes, we can infer worker quality

  • Look at our spammer friend ATAMRO447HWJQ together with other 9 workers

  • We can compute error rates for each worker

  • Error rates for ATAMRO447HWJQ

  • P[X → X]=9.847%P[X → G]=90.153%

  • P[G → X]=0.053%P[G → G]=99.947%

Our “friend” ATAMRO447HWJQmainly marked sites as G.Obviously a spammer…


Rejecting spammers and benefits

Rejecting spammers and Benefits

Random answers error rate = 50%

Average error rate for ATAMRO447HWJQ: 45.2%

  • P[X → X]=9.847%P[X → G]=90.153%

  • P[G → X]=0.053%P[G → G]=99.947%

    Action: REJECT and BLOCK

    Results:

  • Over time you block all spammers

  • Spammers learn to avoid your HITS

  • You can decrease redundancy, as quality of workers is higher


After rejecting spammers quality goes up

After rejecting spammers, quality goes up

  • Spam keeps quality down

  • Without spam, workers are of higher quality

  • Need less redundancy for same quality

  • Same quality of results for lower cost

Without spam

5 workers

94% correct

Without spam

1 worker

80% correct

With spam

11 workers

93% correct

With spam

1 worker

70% correct


Correcting biases

Correcting biases

  • Classifying sites as G, PG, R, X

  • Sometimes workers are careful but biased

  • Classifies G → P and P → R

  • Average error rate for ATLJIK76YH1TF: too high

  • Error Rates for CEO of AdSafe

    • P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0%

    • P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0%

    • P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0%

    • P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

Is she a spammer?


Correcting biases1

Correcting biases

  • Error Rates for Worker: ATLJIK76YH1TF

    • P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0%

    • P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0%

    • P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0%

    • P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

  • For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias

  • True error-rate ~ 9%


Too much theory

Too much theory?

Demo and Open source implementation available at:

http://qmturk.appspot.com

  • Input:

    • Labels from Mechanical Turk

    • Cost of incorrect labelings (e.g., XG costlier than GX)

  • Output:

    • Corrected labels

    • Worker error rates

    • Ranking of workers according to their quality

  • Beta version, more improvements to come!

  • Suggestions and collaborations welcomed!


Scaling crowdsourcing use machine learning

Scaling Crowdsourcing: Use Machine Learning

  • Human labor is expensive, even when paying cents

  • Need to scale crowdsourcing

  • Basic idea: Build a machine learning model and use it instead of humans

Data from existing

crowdsourced answers

New Case

Automatic Model(through machine learning)

Automatic

Answer


Tradeoffs for automatic models effect of noise

Tradeoffs for Automatic Models: Effect of Noise

Get more data  Improve model accuracy

Improve data quality  Improve classification

Example Case: Porn or not?

Data Quality = 100%

Data Quality = 80%

Data Quality = 60%

Data Quality = 50%

20


Scaling crowdsourcing iterative training

Scaling Crowdsourcing: Iterative training

  • Use machine when confident, humans otherwise

  • Retrain with new human input → improve model → reduce need for humans

Automatic

Answer

Confident

New Case

Automatic Model(through machine learning)

Not confident

Get human(s) to answer

Data from existing

crowdsourced answers


Tradeoffs for automatic models effect of noise1

Tradeoffs for Automatic Models: Effect of Noise

Get more data  Improve model accuracy

Improve data quality  Improve classification

Example Case: Porn or not?

Data Quality = 100%

Data Quality = 80%

Data Quality = 60%

Data Quality = 50%

22


Scaling crowdsourcing iterative training with noise

Scaling Crowdsourcing: Iterative training, with noise

  • Use machine when confident, humans otherwise

  • Ask as many humans as necessary to ensure quality

Automatic

Answer

Confident

New Case

Automatic Model(through machine learning)

Not confident for quality?

Not confident

Data from existing

crowdsourced answers

Get human(s) to answer

Confident for quality?


Thank you questions

Thank you!Questions?

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com/

Email: [email protected]


  • Login