E N D
1. 1 Learning from Partially Labeled Data Martin Szummer
MIT AI lab & CBCL
szummer@ai.mit.edu
http://www.ai.mit.edu/people/szummer/ Notes:
above divider: notes to say
----
below divider: possible changes to slide
TODO:
Summary slides to show location; recap slides
Design: MS powerpoint help has nice formatting
Checklist:
. Use L+U instead of N (explicit notation)
Tony Ezzat comments: . most questions concerned manifold learning (he predicted it)
must reference and talk about Sam Roweis & Tenenbaum – that’s the work people now
Notes:
above divider: notes to say
----
below divider: possible changes to slide
TODO:
Summary slides to show location; recap slides
Design: MS powerpoint help has nice formatting
Checklist:
. Use L+U instead of N (explicit notation)
Tony Ezzat comments: . most questions concerned manifold learning (he predicted it)
must reference and talk about Sam Roweis & Tenenbaum – that’s the work people now
2. 2 Detecting cars ---
Sequences from
F:\szummer\data\cars\cd1\Hpn1\04 and 22---
Sequences from
F:\szummer\data\cars\cd1\Hpn1\04 and 22
3. 3 Outline The partially labeled data problem
Data representations
Markov random walk
Classification criteria
Information Regularization data representations: modeling assumptions
---
Video sequences before this
data representations: modeling assumptions
---
Video sequences before this
4. 4 Learning from partially labeled data - semi-supervised learning
Big question: how can unlabeled data help
Want: to improve classification accuracy – learn with fewer examples
“learning” algorithm = clustering OR classification
------------
Big question: how can unlabeled data help
Want: to improve classification accuracy – learn with fewer examples
“learning” algorithm = clustering OR classification
------------
5. 5 Semi-supervised learning from an unsupervised perspective labels constrain and repair clusters
really 4 clusters
Example: biologist with a task in mind:
Let’s cluster gene expression data
I already know gene TN3X and TN4L have similar function; cluster so to they fall into the same cluster!
---
TODO: repair examples
example a bit broken since x axis is not scaled; only distances between clusters are increased
(would need to use Matlab to generate something better
Less important:
Constraints have form: A & B should belong to same/different cluster (pairwise)
really 4 clusters
Example: biologist with a task in mind:
Let’s cluster gene expression data
I already know gene TN3X and TN4L have similar function; cluster so to they fall into the same cluster!
---
TODO: repair examples
example a bit broken since x axis is not scaled; only distances between clusters are increased
(would need to use Matlab to generate something better
Less important:
Constraints have form: A & B should belong to same/different cluster (pairwise)
6. 6 Semi-supervised learning from a supervised perspective
7. 7 Benefits of semi-supervised learning Labeled data can be
expensive
may require human labor, and additional experiments / measurements
impossible to obtain
labels unavailable at the present time; e.g. for prediction
Unlabeled data can be
abundant and cheap!
e.g. image sequences from video cameras, text documents from the web Humans can learn with limited feedback
---------
How?
example: novel words in text can be understood using context
Humans can learn with limited feedback
---------
How?
example: novel words in text can be understood using context
8. 8 Can we always benefit from partially labeled data? Not always!
Assumptions required
Labeled and unlabeled data drawn IID from same distribution
Ignorable missingness mechanism
and…
might seem impossible!
word statistics, new words and contexts of words
ignorable missingness mechanism
---
Draw graphical representation of text examplemight seem impossible!
word statistics, new words and contexts of words
ignorable missingness mechanism
---
Draw graphical representation of text example
9. 9 Key assumption The structure in the unlabeled data must relate to the desired classification; specifically:
A link between the marginal P(x) and the conditional P(y|x), which our classifier is equipped to exploit
Marginal distribution P(x):
describes the input domain
Conditional distribution P(y|x):
describes the classification
Example assumption: points in the same cluster should have the same label Speculate: assumptions made for supervised learning – same as for semi-supervised learning, but just used in a stronger way
--
Old: used to explain the jointSpeculate: assumptions made for supervised learning – same as for semi-supervised learning, but just used in a stronger way
--
Old: used to explain the joint
10. 10 The learning task Transduction: not for real-time systems, but only way to fully exploit unlabeled dataTransduction: not for real-time systems, but only way to fully exploit unlabeled data
11. 11 The learning task: notation tilde over y: dentotes observed label
here presented as transduction, but may want to learn function when some test points are not yet available
and yet avoid retraining later
Task: transduction – only need to predict values of function at particular points [Vapnik]
---
Greedy approach to semisupervised learning; label the most confident at each stage
- no confidence information
(could have its own slide)
tilde over y: dentotes observed label
here presented as transduction, but may want to learn function when some test points are not yet available
and yet avoid retraining later
Task: transduction – only need to predict values of function at particular points [Vapnik]
---
Greedy approach to semisupervised learning; label the most confident at each stage
- no confidence information
(could have its own slide)
12. 12 Previous approach: missing data with EM Maximize likelihood of a generative model that accounts for P(x) and P(x,y)
Models P(x) and P(x,y) can be mixtures of Gaussians [Miller & Uyar], or Naďve Bayes [Nigam et al]
Issues: what model? How weight unlabeled vs. labeled? [Kowalski – extend the representation]
[Kowalski – extend the representation]
13. 13 Previous approach: Large margin on unlabeled data Transduction with SVM or MED (max entropy discrimination)
Issues: computational cost Link between P(x) and P(y|x)
Large margin methods (SVM, boosting)
Decision boundary preferentially lies in low-density regions of P(x)
Optional:
Semi-supervised boosting
--
labels constrain and repair clusters
unlabeled points regularize
Constraints: this point should belong to a given class (pointwise)Link between P(x) and P(y|x)
Large margin methods (SVM, boosting)
Decision boundary preferentially lies in low-density regions of P(x)
Optional:
Semi-supervised boosting
--
labels constrain and repair clusters
unlabeled points regularize
Constraints: this point should belong to a given class (pointwise)
14. 14 Outline The partially labeled data problem
Data representations
Markov random walk
Classification criteria
Information Regularization
15. 15 Unsupervised – uses x of all data points
Supervised – uses y of all labeled data points needs representation of only labeled data points
Fisher kernel – uses a similar approach train a generative model; then apply it in a classifier[Hofmann]
Theorem: will get a good discriminative classifier
Example:
1) Representation: as given, but normalize each data point
2a) Clustering: spectral method
b) Classification: linear classifier
-----
describe how linear classifier uses similarity of representation
Unsupervised – uses x of all data points
Supervised – uses y of all labeled data points needs representation of only labeled data points
Fisher kernel – uses a similar approach train a generative model; then apply it in a classifier[Hofmann]
Theorem: will get a good discriminative classifier
Example:
1) Representation: as given, but normalize each data point
2a) Clustering: spectral method
b) Classification: linear classifier
-----
describe how linear classifier uses similarity of representation
16. 16 Clusters and low-dimensional structures Partially labeled learning can work by:
unlabeled points uncover structure of the data,e.g. clusters, assumed to have generally homogenous but unknown labels
labeled points suggest class of the clusters
---
add kernel expansion output
Focus more on clusters, not only on manifolds!Partially labeled learning can work by:
unlabeled points uncover structure of the data,e.g. clusters, assumed to have generally homogenous but unknown labels
labeled points suggest class of the clusters
---
add kernel expansion output
Focus more on clusters, not only on manifolds!
17. 17 Representation desiderata Conditional should follow the data manifold- data may lie in a low-dimensional subspace
Example: neighborhood graph
Robustly measure similarity between points.Consider volume of all paths, not just shortest path.
Example: Markov random walk
Variable resolution: adjustable cluster size or number(differentiate points at coarser scales, not at finer scales)Example: number of time steps t of Markov random walk determines whether two points appear indistinguishable
Construct a representation P(i|xk) that satisfies these goals. ----
“Follow” data manifold (= Respect, Represent, Capture)
Find structure of data
Explicit / implicit clustering
----
“Follow” data manifold (= Respect, Represent, Capture)
Find structure of data
Explicit / implicit clustering
18. 18 Example: Markov random walk representation
Local metrics are easier to define.
How go from a local metric to a global one?
RHS:
d – we use Euclidean metric
d not d^2 – additive metric
Global representation:
mixture model: each component “generates” / “causes” another point
P_{0|t} (i|k) = P_{t|0} (k|i) P(i) /N so LHS is just renormalized represent a point as the probability of being generated by a set of components
here: one component for each point; uniform starting probability
given that random walk reaches a point k, what is the probability of having started at i.
P_{0|t} (i|k) normalized st sums to 1 over i (starting pt) instead of k (end pt). this is what we’ll need for classification
Note: all points must be available at training timeLocal metrics are easier to define.
How go from a local metric to a global one?
RHS:
d – we use Euclidean metric
d not d^2 – additive metric
Global representation:
mixture model: each component “generates” / “causes” another point
P_{0|t} (i|k) = P_{t|0} (k|i) P(i) /N so LHS is just renormalized represent a point as the probability of being generated by a set of components
here: one component for each point; uniform starting probability
given that random walk reaches a point k, what is the probability of having started at i.
P_{0|t} (i|k) normalized st sums to 1 over i (starting pt) instead of k (end pt). this is what we’ll need for classification
Note: all points must be available at training time
19. 19 Representation
Each point k is represented as a vector of (conditional) probabilities over the possible starting states i of a t step random walk ending up in k.
Two points are similar ? their random walks have indistinguishable starting points If you cannot tell points apart, they wil have the same coordinate vector -> low distance.
If you cannot tell points apart, they wil have the same coordinate vector -> low distance.
20. 20 Parameter: length of random walk t Higher t ? coarser representation; fewer clusters
Limits: t = 0, ? (degenerate)
Choosing t – based on unlabeled data alone
diameter of graph
mixing time of graph (2nd eigenvalue of transition matrix)
Choosing t – based on both labeled + unlabeled data
when labels are consistent over large regions ? t is high
criteria: maximize likelihood, or margin, or cross-validation t regulates scale of clusters (indirectly their number)
t=1 (just 1 time step transition)
diameter of graph: can also use distance time to nearest labeled point
guarantees we can transition from any point to any other point (w/i each connected component)
Limits: t=0,1 (look at the formula for the matrix; notice A^0 = 1; A^1=A; A^t = stationary)
but recall we must condition to get the formula
t does not change spectral decomposition of graph
Mixing time in graph: topology dependent
level of mixing: L1 dist or rel L1 dist from stationary dist (TODO Check)
Q: but how set parameters
t >= max 1/(1-lambda_2) * (ln 1/p_i^infty + ln 1/epsilon)
-----
display formula for mixing time?t regulates scale of clusters (indirectly their number)
t=1 (just 1 time step transition)
diameter of graph: can also use distance time to nearest labeled point
guarantees we can transition from any point to any other point (w/i each connected component)
Limits: t=0,1 (look at the formula for the matrix; notice A^0 = 1; A^1=A; A^t = stationary)
but recall we must condition to get the formula
t does not change spectral decomposition of graph
Mixing time in graph: topology dependent
level of mixing: L1 dist or rel L1 dist from stationary dist (TODO Check)
Q: but how set parameters
t >= max 1/(1-lambda_2) * (ln 1/p_i^infty + ln 1/epsilon)
-----
display formula for mixing time?
21. 21 Parameter: local neighborhood size K kernel width s K = number of nearest neighbors
K too low: disconnected components distorted sense of distances
K too high: local neighborhood relation becomes inaccurate
neighbor relation is made symmetric; self-transitions allowed
s = local distance decay (random walk)
Influences local smoothness of representation K too high: leaks in manifold
symmetric neighborhood – want undirected graph
Cross-validation for parameter settings – NO!
----
---
DO insert pictures instead
sigma - kernel width (kernel expansion)
Select kernel widths
a) based on distance to K nearest neighbor
global widths: use median distance [to opposite class]
adaptive widths: shrink in high density regions, expand in low-density ones
b) by theoretical analysis of smoothness of P(i|x)
---
old 1/beta notation - removed
K too high: leaks in manifold
symmetric neighborhood – want undirected graph
Cross-validation for parameter settings – NO!
----
---
DO insert pictures instead
sigma - kernel width (kernel expansion)
Select kernel widths
a) based on distance to K nearest neighbor
global widths: use median distance [to opposite class]
adaptive widths: shrink in high density regions, expand in low-density ones
b) by theoretical analysis of smoothness of P(i|x)
---
old 1/beta notation - removed
22. 22 A Generative Model for the Labels Given: nodes i (corresponding to points xi )
Given: label distributions Q(y|i) at each node i
Model generates a node identity and a label
Draw a node identity i uniformly
Draw a label y ~ Q(y|i)
2. Add t rounds of identity noise: node i is confused with node k according to P(k|i). Label y is intact.
3. Output final identity k, and the label y
During classification: only the noisy node identity is observed, and we want to determine the label y.
23. 23 Given the noisy node identity k, infer possible starting node identities i,
and weight their label distributions
Question: how do we obtain Q(y|i)?
Classification model Consider all random walks from other points that could have ended up in k in t steps.
Assign a label to k based on the conditional probabilities over the starting points.
How do we obtain parameter distributions:
for labeled points, have the label – but for unlabeled points, know nothingConsider all random walks from other points that could have ended up in k in t steps.
Assign a label to k based on the conditional probabilities over the starting points.
How do we obtain parameter distributions:
for labeled points, have the label – but for unlabeled points, know nothing
24. 24 Classification model (2) Unlike a linear classifier
parameters Q(y|i) are bounded, limiting the effects of outliers
classifier is directly applicable to multiple classes
Link between P(x) and P(y|x): smoothness of the representation Q(y|i) estimated for labeled points too (previously just assumed a model for it)
Benefits (from hidden slide)
. uses the unlabeled examples
. when kernels are chosen correctly, the estimate is consistent, Bayes optimal classifier
###################
Other estimation criteria: * max joint likelihood * Bayesian estimation * any alg that maintains a probabilistic interpretation of P(y|i)
---
Recall kernel density motivation:
So far: labeled data only in kernel density estimate
Now: relax kernel density assumption of labels for all points
Unlike linear classifier:
averaging behaviorQ(y|i) estimated for labeled points too (previously just assumed a model for it)
Benefits (from hidden slide)
. uses the unlabeled examples
. when kernels are chosen correctly, the estimate is consistent, Bayes optimal classifier
###################
Other estimation criteria: * max joint likelihood * Bayesian estimation * any alg that maintains a probabilistic interpretation of P(y|i)
---
Recall kernel density motivation:
So far: labeled data only in kernel density estimate
Now: relax kernel density assumption of labels for all points
Unlike linear classifier:
averaging behavior
25. 25 Maximize conditional log-likelihood EM algorithm
Much easier than EM w Gauss mixtures – here only estimate labels for component i
Conditional / Discriminative model : since we only affect y part of the model
i -> p(i|x_l) fixed
-> p(y|i) learned
Similar to estimating mixture weights
Solution properties: P(y|i) become hard 0, 1 (or if cannot reach labels from a point – soft at 0.5,0.5 – initial condition)
Much easier than EM w Gauss mixtures – here only estimate labels for component i
Conditional / Discriminative model : since we only affect y part of the model
i -> p(i|x_l) fixed
-> p(y|i) learned
Similar to estimating mixture weights
Solution properties: P(y|i) become hard 0, 1 (or if cannot reach labels from a point – soft at 0.5,0.5 – initial condition)
26. 26 Swiss roll problem
27. 27 Swiss roll problem K=5 (symmetrized)K=5 (symmetrized)
28. 28 t=20
29. 29 t=10
30. 30 t=3
31. 31 Summary: Markov Random Walk representation Points are expressed as a vectors of probabilities, of having been generated by every other point
Related work:
Clustering
Markovian relaxation [Tishby & Slonim 00]
Spectral clustering [Shi & Malik 97; Meila & Shi 00; ++]
Visualization:
Isomap [Tenenbaum 99]
Linear local embedding [Roweis & Saul 00]
32. 32 Outline The partially labeled data problem
Data representations
Kernel expansion
Markov random walk
Classification criteria
conditional maximum likelihood with EM
maximize average margin
…
Information Regularization How to train the classifier: how to train the Q(y|i)
have already talked about maximum likelihood estimates
How to train the classifier: how to train the Q(y|i)
have already talked about maximum likelihood estimates
33. 33 Discriminative boundaries Focus on classification decisions more directly than maximum likelihood does
Classify labeled points with a margin
Margin at point xk : confidence of the classifier
ML – objective is not related to classification task
---
When EM slides present:
(even more discriminative than ML – objective is not related to classification task
---
When EM slides present:
(even more discriminative than
34. 34 Margin based estimation
maximize average margin
margin definition – as in boosting (Schapire)
margin – confidence measure – few functions f achieve high margin
correct classification with margin gamma
Unbalanced classes – must fix esp for average margin
Closed form not surprising: linear programming has optima at extreme points
---
Margin vs likelihood (see my notes)
Q: How can we get away w linear program when SVM needs QPmargin definition – as in boosting (Schapire)
margin – confidence measure – few functions f achieve high margin
correct classification with margin gamma
Unbalanced classes – must fix esp for average margin
Closed form not surprising: linear programming has optima at extreme points
---
Margin vs likelihood (see my notes)
Q: How can we get away w linear program when SVM needs QP
35. 35 Average margin solution has a closed form Closed form: assign weight 1 to the class with largest total “flow” to point m.
Two rounds of a weighted neighbor classifier
Classify all points based on the labeled points
Classify all points based on the previous classification
36. 36 Text classification with Markov random walks ---
no EM on this figure---
no EM on this figure
37. 37 Choosing t based on margin
38. 38 Gene splice sites classification (t=1) Gene splice sites (500 examples, 100 dimensions)
Leukemia (38 training, 34 test examples, 7000 dimensions)
Procedure:
vary # labeled examples averaged over 20 test runs
Respond to objections: very many labeled points required; representation too flexible
---
SVM reaches 8% error level after ____ examples
MED unlabeled error bars
NN
Fix so that ps previewer works on Postscript (copy directly from Matlab fig instead of inserting Postscript?)Gene splice sites (500 examples, 100 dimensions)
Leukemia (38 training, 34 test examples, 7000 dimensions)
Procedure:
vary # labeled examples averaged over 20 test runs
Respond to objections: very many labeled points required; representation too flexible
---
SVM reaches 8% error level after ____ examples
MED unlabeled error bars
NN
Fix so that ps previewer works on Postscript (copy directly from Matlab fig instead of inserting Postscript?)
39. 39 Leukemia classification with kernel expansion Promising!Promising!
40. 40 Gene splice site (2) ---
Q: Exponentially fast: Cover & Castelli
---
Q: Exponentially fast: Cover & Castelli
41. 41 Car Detection
42. 42 Haar wavelet features
43. 43
44. 44
45. 45
46. 46
47. 47
48. 48
49. 49 Adaptive time scales Set time scale to maximize mutual information between label y and node identity k
for unlabeled points only --------
50. 50 Outline The partially labeled data problem
Data representations
Kernel expansion
Markov random walk
Classification criteria
Information Regularization
51. 51 Information Regularization Overview Markov random walk
Linked P(x) to P(y|x) indirectly through the classification model
Information Regularization
Explicitly and directly links P(x) to P(y|x)
Makes no parametric assumptions on the link
-----
Minimizes information about the labels in covering regions
Is computationally feasible for continuous P(x)
-----
Minimizes information about the labels in covering regions
Is computationally feasible for continuous P(x)
52. 52 Assumption:
Inside small regions with a large number of points, the labeling should not change
Regularization approach:
Cover the domain with small regions, and penalize inhomogeneous labelings in the regions cluster assumption
--
title: Explicitly linking marginal and conditionalcluster assumption
--
title: Explicitly linking marginal and conditional
53. 53 Mutual information Mutual information I(x; y) over a region
I(x; y) = how many bits of information does knowledge about x contribute to knowledge about y, on average
I(x ; y) = H(y) – H(y|x), a function of P(x) and P(y|x)
a measure of homogeneity of labels
---
homogeneity – not only; I(x;y) depends on value of P(y|x). If P(y|x) close to 0.5 then same change in P(y|x) gives lower mutual information, than if P(y|x) is close to 1.---
homogeneity – not only; I(x;y) depends on value of P(y|x). If P(y|x) close to 0.5 then same change in P(y|x) gives lower mutual information, than if P(y|x) is close to 1.
54. 54 Mutual Information – a homogeneity measure Example: x = location within the circle; y ={+, –} Regularizer does not consider spatial configuration within the region, hence the regions must be small to provide spatial locality.Regularizer does not consider spatial configuration within the region, hence the regions must be small to provide spatial locality.
55. 55 Penalize weighted mutual information over a small region Q in the input domain
MQ = probability mass of x in region Q
high density region ? penalize more
VQ = variance of x in region Q
IQ/VQ is independent of size of Q as Q shrinks Information Regularization (in small region) in 1D
M_Q – penalize
---
high-D infomargin
should formula for limiting arg be included?
equals Fisher information of x about the labels
in 1D
M_Q – penalize
---
high-D infomargin
should formula for limiting arg be included?
equals Fisher information of x about the labels
56. 56 Information Regularization (whole domain) Cover the domain with small overlapping regions
Regularize each region
Cover should be connected
Example cover: balls centered
at each data point
---
fix multiple redraws, maybe by pasting new figure
kNN ngh size
picture size: 5”7---
fix multiple redraws, maybe by pasting new figure
kNN ngh size
picture size: 5”7
57. 57 Minimize Max Information Content Minimize the maximum information ? contained in any region Q in the cover Mention average margin formulation.
Mention average margin formulation.
58. 58 Incorporating Noisy Labels Noise level b: from prior knowledge, or cross-validate
Expected error – better!
--
if have two conflicting labeled points at the same location – use label errorNoise level b: from prior knowledge, or cross-validate
Expected error – better!
--
if have two conflicting labeled points at the same location – use label error
59. 59 Solution Properties
Atomic subregions
solution P(y|x) is constant inside atomic subregions
need only introduce one variable P(y|x) for each atomic subregion, and only for non-empty subregions
can work with a given continuous P(x)
computational feasibility: depends on cover and P(x)
60. 60 Implementation Constrained nonlinear optimization
convex
Newton method (BFGS)
Dual problem shows structure of solution
P(y|x) in an atomic subregion is a weighted geometric mean of label averages P(y|Q) of the regions Q that the subregion belongs to
Cover: preliminary implementation in 1D, for a given continuous density
61. 61 Solution for mixture of Gaussians ---
Filename: D1gauss2R40-talk.fig---
Filename: D1gauss2R40-talk.fig
62. 62
63. 63 How many regions are needed in the cover?
64. 64 Summary Partially labeled data problem: link P(x) with P(y|x)
Kernel classifier for partially labeled data
Markov random walk representation
Associated parameter inference criteria
Information regularization
Experiments: partially labeled data helps!
65. 65 Conclusions Solutions to the partially labeled problem rely on assumptions at the core of machine learning
Classification with the Markov random walk representation:
works well for text and images; possibly a general alternative to Gaussian mixture models?
Discriminative training via large margin techniques: can be done in closed form!
Information regularization: a very general method of linking P(x) to P(y|x)
Partially labeled data can significantly improve classification performance; enables new applications Implications:
goes to the very core of machine learning – what are the fundamental assumptions between P(x) and P(y|x)
Implications:
goes to the very core of machine learning – what are the fundamental assumptions between P(x) and P(y|x)
66. 66 Future directions Related learning tasks
Regression with partially labeled data
Classification with known marginal P(x)
Other types of missing label problems
Noisy labels
incorrect or probabilistic labels
coarse labels (hierarchical labels)
No labels
anomaly detection
positive labels, but no negative labels
Active learning (query learning)
learner asks for labels of unlabeled points it expects to be informative data comes predominantly from one class, but may contain outliers from other classes
Missing data problems contain partial features x and/or labels y.
---
anomaly detection – can data come from the other class?
does not seem like it!
Multiple instance learning
only label sets of points
example: label image positive if it contains target object anywheredata comes predominantly from one class, but may contain outliers from other classes
Missing data problems contain partial features x and/or labels y.
---
anomaly detection – can data come from the other class?
does not seem like it!
Multiple instance learning
only label sets of points
example: label image positive if it contains target object anywhere
67. 67 Acknowledgements Tommi Jaakkola
Tommy Poggio
Tom Minka
Andy Crane
---
Media lab logo---
Media lab logo
68. 68 Xtra slides Text dimensionality