an overview on semi supervised learning methods l.
Skip this Video
Loading SlideShow in 5 Seconds..
An Overview on Semi-Supervised Learning Methods PowerPoint Presentation
Download Presentation
An Overview on Semi-Supervised Learning Methods

Loading in 2 Seconds...

play fullscreen
1 / 15

An Overview on Semi-Supervised Learning Methods - PowerPoint PPT Presentation

  • Uploaded on

An Overview on Semi-Supervised Learning Methods. Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany. Overview. The SSL Problem Paradigms for SSL. Examples The Importance of Input-dependent Regularization Note : Citations omitted here (given in my literature review). m. y.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

An Overview on Semi-Supervised Learning Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an overview on semi supervised learning methods

An Overview onSemi-Supervised LearningMethods

Matthias SeegerMPI for Biological Cybernetics

Tuebingen, Germany

  • The SSL Problem
  • Paradigms for SSL. Examples
  • The Importance ofInput-dependent Regularization

Note: Citations omitted here (given inmy literature review)

semi supervised learning





Semi-Supervised Learning

SSL is Supervised Learning...

Goal: Estimate P(y|x) from Labeled DataDl={ (xi,yi) }

But: Additional Source tells about P(x)(e.g., Unlabeled Data Du={xj})

The Interesting Case:

obvious baseline methods
Obvious Baseline Methods

The Goal of SSL is To Do Better

Not: Uniformly and always(No Free Lunch; and yes (of course): Unlabeled data can hurt)

But (as always): If our modelling and algorithmic efforts reflecttrue problem characteristics

  • Do not use info about P(x) Supervised Learning
  • Fit a Mixture Modelusing Unsupervised Learning, then“label up” components using {yi}
the generative paradigm





The Generative Paradigm
  • Model Class Distributions and
  • Implies model for P(y|x)and for P(x)
the joint likelihood
The Joint Likelihood

Natural Criterion in this context:

  • Maximize using EM (idea as old as EM)
  • Early and recent theoretical work onasymptotic variance
  • Advantage: Easy to implement forstandard mixture model setups
drawbacks of generative ssl
Drawbacks of Generative SSL
  • Choice of source weightingl crucial
    • Cross-Validation fails for small n
    • Homotopy Continuation (Corduneanu etal.)
  • Just like in Supervised Learning:
    • Model for P(y|x) specified indirectly
    • Fitting not primarily concerned with P(y|x).Also: Have to represent P(x)generally wellNot just aspects which help with P(y|x).
the diagnostic paradigm





The Diagnostic Paradigm
  • Model P(y|x,q) and P(x|m)directly
  • But: Since q,m areindependent a priori,q does not depend on m, given data Knowledge of mdoes not influenceP(y|x) prediction in a probabilistic setup!
what to do about it
What To Do About It
  • Non-probabilistic diagnostic techniques
    • Replace expected lossbyTong, Koller; Chapelle etal. Very limited effect if n small
    • Some old work (eg., Anderson)
  • Drop the prior independence of q,m Input-dependent Regularization
input dependent regularization




Input-Dependent Regularization


  • Conditional priors P(q|m)make P(y|x) estimationdependent on P(x),
  • Now, unlabeled data can really help...
  • And can hurt for the same reason!
the cluster assumption ca
The Cluster Assumption (CA)
  • Empirical Observation: Clustering of data {xj} w.r.t. “sensible” distance / features often fairly compatible with class regions
  • Weaker: Class regions do not tend to cut high-volume regions of P(x)
  • Why? Ask Philosophers! My guess:Selection bias for features/distance

No Matter Why:

Many SSL Methods implement theCA and work fine in practice

examples for idr using ca
Examples For IDR Using CA
  • Label Propagation, Gaussian Random Fields: Regularization depends on graph structure which is built from all {xj} More smoothness in regions of high connectivity / affinity flows
  • Cluster kernels for SVM (Chapelle etal.)
  • Information Regularization(Corduneanu, Jaakkola)
more examples for idr
More Examples for IDR

Some methods do IDR, but implement the CA only in special cases:

  • Fisher Kernels (Jaakkola etal.)Kernel from Fisher features Automatic feature induction fromP(x) model
  • Co-Training (Blum, Mitchell)Consistency across diff. views (features)
is ssl always generative
Is SSL Always Generative?

Wait: We have to model P(x) somehow.Is this not always generative then? ... No!

  • Generative: Model P(x|y) fairly directly, P(y|x) model and effect of P(x) are implicit
  • Diagnostic IDR:
    • Direct model for P(y|x), more flexibility
    • Influence of P(x) knowledge on P(y|x) prediction directly controlled, eg. through CA Model for P(x) can be much less elaborate
  • Given taxonomy for probabilistic approaches to SSL
  • Illustrated paradigms by examples from literature
  • Tried to clarify some points which have led to confusions in the past