Generative models for crowdsourced data
Sponsored Links
This presentation is the property of its rightful owner.
1 / 48

Generative Models for Crowdsourced Data PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on
  • Presentation posted in: General

Generative Models for Crowdsourced Data. Outline. What is Crowdsourcing ? Modeling the labeling process Example with real data Extensions Future Directions. What is Crowdsourcing ?. Human based computation. Outsourcing certain steps of a computation to humans.

Download Presentation

Generative Models for Crowdsourced Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Generative Models for Crowdsourced Data


Outline

  • What is Crowdsourcing?

  • Modeling the labeling process

  • Example with real data

  • Extensions

  • Future Directions


What is Crowdsourcing?

  • Human based computation.

  • Outsourcing certain steps of a computation to humans.

  • ``Artificial artificial intelligence.’’

  • Data science:

    • Making an immediate decision.

    • Creating a labeled data set for learning.


Immediate Decision Workflow


Labeled Data Set Workflow


An Example HIT


An Example HIT


Funny enough …

  • Not everybody agrees on the gender of a Twitter profile.

  • Difficult Instances

  • Worker Ability / Motivation

  • Worker Bias

  • AdversarialBehaviour


Difficult Instance


Difficult Instance


Difficult Instance


Worker Ability


Worker Ability


Worker Ability


Worker Motivation


Worker Motivation


Worker Bias


Worker Bias


Worker Bias


Disagreements

  • When some workers say “male” and some workers say “female”, what to do?


Majority Rules Heuristic

  • Assign label l to item x if a majority of workers agree.

  • Otherwise item x remains unlabeled.


Majority Rules Heuristic

  • Assign label l to item x if a majority of workers agree.

  • Otherwise item x remains unlabeled.

  • Ignores prior worker data.


Majority Rules Heuristic

  • Assign label l to item x if a majority of workers agree.

  • Otherwise item xremains unlabeled.

  • Ignores prior worker data.

  • Introduce bias in labeled data.


Train on all labels

  • For labeled data set workflow.

  • Add all item-label pairs to the data set.

  • Equivalent to cost vector of:

    • P (l | { lw}) = 1/nwS 1{l = lw}


Train on all labels

  • For labeled data set workflow.

  • Add all item-label pairs to the data set.

  • Equivalent to cost vector of:

    • P (l | { lw}) = 1/nwS1{l = lw}

  • Ignores prior worker data.


Train on all labels

  • For labeled data set workflow.

  • Add all item-label pairs to the data set.

  • Equivalent to cost vector of:

    • P (l | { lw}) = 1/nwS1{l = lw}

  • Ignores prior worker data.

  • Models the crowd, not the “ground truth.”


What is ground truth

  • Different theoretical approaches.

    • PAC learning with noisy labels.

    • Fully-adversarial active learning.

  • Bayesians have been very active.

    • “Easy” to posit a functional form and quickly develop inference algorithms.

    • Issue of model correctness is ultimately empirical.


Bayesian Literature

  • (2009) Whitehill et. al. GLAD framework.

    • (1979) Dawid and Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.

  • (2010) Welinder et. al. The Multidimensional Wisdom of Crowds.

  • (2010) Raykar et. al. Learning from Crowds.


Bayesian Approach

  • Define ground truth via a generative model which describes how “ground truth” is related to the observed output of crowdsource workers.

  • Fit to observed data.

  • Extract posterior over ground truth.

  • Make decision or train classifier.


Generative Model


Example: Binary Classification

  • Each worker has a matrix.

    α = ( -1 α01)

    ( α10 -1 )

  • Each item has a scalar difficulty β > 0.

  • P (lw = j | z = i) = e-βαij / (Σk e-βαik)

  • αij ~ N (μij, 1) ; μij ~ N (0, 1)

  • log β ~ N (ρ, 1) ; ρ ~ N (0, 1)


Other Problems

  • Multiclass classification:

    • Same as binary with larger confusion matrix.

  • Ordinal classification: (“Hot or not”)

    • Confusion matrix has special form.

    • O (L) parameters instead of O (L2).

  • Multilabel classification:

    • Reduce to multiclass on power set.

    • Assume low-rank confusion matrix.


EM


EM

  • Initially all workers are assumed moderately accurate and without bias.

    • Implies initial estimate of ground truth distribution favors consensus.

    • Disagreeing with the majority is a likely error.


EM

  • Initially all workers are assumed moderately accurate.

  • Workers consistently in the minority have their confusion probabilities increase.


EM

  • Initially all workers are assumed moderately accurate.

  • Workers consistently in the minority have their confusion probabilities increase.

  • Workers with higher confusion probabilities contribute less to the distribution of ground truth.


“Different” workers are marginalized


“Different” workers are marginalized

  • Workers that are consistently in the minority will not contribute strongly to the posterior distribution over ground truth.

    • Even if they are actually more accurate.

  • Can correct when an accurate worker(s) is paired with some inaccurate workers.

  • Good for breaking ties.

  • Raykar et. al.


Example with real data


Online EM

  • Given a set of worker-label pairs for a single item:

  • (Inference) Using current α, find most likely β* and distribution q* over ground truth.

  • (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.


Online EM

  • Given a set of worker-label pairs for a single item:

  • (Inference) Using current α, find most likely β* and distribution q* over ground truth.

  • (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.


Things to do with q*

  • Take an immediate cost-sensitive decision

    • d* = argmindEz~q*[f (z, d)]

  • Train a (importance-weighted) classifier

    • cost vector cd = Ez~q*[f (z, d)]

    • e.g. 0/1 loss: cd = (1 - q*d)

    • e.g. binary 0/1 loss: |c1 – c0| = |1 – 2 q*1|

    • No need to decide what the true label is!

  • Raykar et. al.: why not jointly estimate classifier and worker confusion?


Raykar et. al. insight

  • Cost vector is constructed by estimating worker confusion matrices.

  • Subsequently, classifier is trained; it will sometimes disagree with workers.

  • Would be nice to use that disagreement to inform the worker confusion matrices.

  • Circular dependency suggests joint estimation.


Generative Model


Generative Model


Online Joint Estimation


Online Joint Estimation

  • Initially the classifier will output an uninformative prior and therefore will be trained to follow consensus of workers.

  • Eventually workers which disagree with the classifier will have their confusion probabilities increase.

  • Workers consistently in the minority can contribute strongly to the posterior if they tend to agree with the classifier.


Additional Resources

  • Software

    • http://code.google.com/p/nincompoop

  • Blog

    • http://machinedlearnings.com/


  • Login