Generative models for crowdsourced data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 48

Generative Models for Crowdsourced Data PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on
  • Presentation posted in: General

Generative Models for Crowdsourced Data. Outline. What is Crowdsourcing ? Modeling the labeling process Example with real data Extensions Future Directions. What is Crowdsourcing ?. Human based computation. Outsourcing certain steps of a computation to humans.

Download Presentation

Generative Models for Crowdsourced Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Generative models for crowdsourced data

Generative Models for Crowdsourced Data


Outline

Outline

  • What is Crowdsourcing?

  • Modeling the labeling process

  • Example with real data

  • Extensions

  • Future Directions


What is crowdsourcing

What is Crowdsourcing?

  • Human based computation.

  • Outsourcing certain steps of a computation to humans.

  • ``Artificial artificial intelligence.’’

  • Data science:

    • Making an immediate decision.

    • Creating a labeled data set for learning.


Immediate decision workflow

Immediate Decision Workflow


Labeled data set workflow

Labeled Data Set Workflow


An example hit

An Example HIT


An example hit1

An Example HIT


Funny enough

Funny enough …

  • Not everybody agrees on the gender of a Twitter profile.

  • Difficult Instances

  • Worker Ability / Motivation

  • Worker Bias

  • AdversarialBehaviour


Difficult instance

Difficult Instance


Difficult instance1

Difficult Instance


Difficult instance2

Difficult Instance


Worker ability

Worker Ability


Worker ability1

Worker Ability


Worker ability2

Worker Ability


Worker motivation

Worker Motivation


Worker motivation1

Worker Motivation


Worker bias

Worker Bias


Worker bias1

Worker Bias


Worker bias2

Worker Bias


Disagreements

Disagreements

  • When some workers say “male” and some workers say “female”, what to do?


Majority rules heuristic

Majority Rules Heuristic

  • Assign label l to item x if a majority of workers agree.

  • Otherwise item x remains unlabeled.


Majority rules heuristic1

Majority Rules Heuristic

  • Assign label l to item x if a majority of workers agree.

  • Otherwise item x remains unlabeled.

  • Ignores prior worker data.


Majority rules heuristic2

Majority Rules Heuristic

  • Assign label l to item x if a majority of workers agree.

  • Otherwise item xremains unlabeled.

  • Ignores prior worker data.

  • Introduce bias in labeled data.


Train on all labels

Train on all labels

  • For labeled data set workflow.

  • Add all item-label pairs to the data set.

  • Equivalent to cost vector of:

    • P (l | { lw}) = 1/nwS 1{l = lw}


Train on all labels1

Train on all labels

  • For labeled data set workflow.

  • Add all item-label pairs to the data set.

  • Equivalent to cost vector of:

    • P (l | { lw}) = 1/nwS1{l = lw}

  • Ignores prior worker data.


Train on all labels2

Train on all labels

  • For labeled data set workflow.

  • Add all item-label pairs to the data set.

  • Equivalent to cost vector of:

    • P (l | { lw}) = 1/nwS1{l = lw}

  • Ignores prior worker data.

  • Models the crowd, not the “ground truth.”


What is ground truth

What is ground truth

  • Different theoretical approaches.

    • PAC learning with noisy labels.

    • Fully-adversarial active learning.

  • Bayesians have been very active.

    • “Easy” to posit a functional form and quickly develop inference algorithms.

    • Issue of model correctness is ultimately empirical.


Bayesian literature

Bayesian Literature

  • (2009) Whitehill et. al. GLAD framework.

    • (1979) Dawid and Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.

  • (2010) Welinder et. al. The Multidimensional Wisdom of Crowds.

  • (2010) Raykar et. al. Learning from Crowds.


Bayesian approach

Bayesian Approach

  • Define ground truth via a generative model which describes how “ground truth” is related to the observed output of crowdsource workers.

  • Fit to observed data.

  • Extract posterior over ground truth.

  • Make decision or train classifier.


Generative model

Generative Model


Example binary classification

Example: Binary Classification

  • Each worker has a matrix.

    α = ( -1 α01)

    ( α10 -1 )

  • Each item has a scalar difficulty β > 0.

  • P (lw = j | z = i) = e-βαij / (Σk e-βαik)

  • αij ~ N (μij, 1) ; μij ~ N (0, 1)

  • log β ~ N (ρ, 1) ; ρ ~ N (0, 1)


Other problems

Other Problems

  • Multiclass classification:

    • Same as binary with larger confusion matrix.

  • Ordinal classification: (“Hot or not”)

    • Confusion matrix has special form.

    • O (L) parameters instead of O (L2).

  • Multilabel classification:

    • Reduce to multiclass on power set.

    • Assume low-rank confusion matrix.


Generative models for crowdsourced data

EM


Generative models for crowdsourced data

EM

  • Initially all workers are assumed moderately accurate and without bias.

    • Implies initial estimate of ground truth distribution favors consensus.

    • Disagreeing with the majority is a likely error.


Generative models for crowdsourced data

EM

  • Initially all workers are assumed moderately accurate.

  • Workers consistently in the minority have their confusion probabilities increase.


Generative models for crowdsourced data

EM

  • Initially all workers are assumed moderately accurate.

  • Workers consistently in the minority have their confusion probabilities increase.

  • Workers with higher confusion probabilities contribute less to the distribution of ground truth.


Different workers are marginalized

“Different” workers are marginalized


Different workers are marginalized1

“Different” workers are marginalized

  • Workers that are consistently in the minority will not contribute strongly to the posterior distribution over ground truth.

    • Even if they are actually more accurate.

  • Can correct when an accurate worker(s) is paired with some inaccurate workers.

  • Good for breaking ties.

  • Raykar et. al.


Example with real data

Example with real data


Online em

Online EM

  • Given a set of worker-label pairs for a single item:

  • (Inference) Using current α, find most likely β* and distribution q* over ground truth.

  • (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.


Online em1

Online EM

  • Given a set of worker-label pairs for a single item:

  • (Inference) Using current α, find most likely β* and distribution q* over ground truth.

  • (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.


Things to do with q

Things to do with q*

  • Take an immediate cost-sensitive decision

    • d* = argmindEz~q*[f (z, d)]

  • Train a (importance-weighted) classifier

    • cost vector cd = Ez~q*[f (z, d)]

    • e.g. 0/1 loss: cd = (1 - q*d)

    • e.g. binary 0/1 loss: |c1 – c0| = |1 – 2 q*1|

    • No need to decide what the true label is!

  • Raykar et. al.: why not jointly estimate classifier and worker confusion?


Raykar et al insight

Raykar et. al. insight

  • Cost vector is constructed by estimating worker confusion matrices.

  • Subsequently, classifier is trained; it will sometimes disagree with workers.

  • Would be nice to use that disagreement to inform the worker confusion matrices.

  • Circular dependency suggests joint estimation.


Generative model1

Generative Model


Generative model2

Generative Model


Online joint estimation

Online Joint Estimation


Online joint estimation1

Online Joint Estimation

  • Initially the classifier will output an uninformative prior and therefore will be trained to follow consensus of workers.

  • Eventually workers which disagree with the classifier will have their confusion probabilities increase.

  • Workers consistently in the minority can contribute strongly to the posterior if they tend to agree with the classifier.


Additional resources

Additional Resources

  • Software

    • http://code.google.com/p/nincompoop

  • Blog

    • http://machinedlearnings.com/


  • Login