Learning from Partially Labeled Data

1. 1 Learning from Partially Labeled Data Martin Szummer MIT AI lab & CBCL szummer@ai.mit.edu http://www.ai.mit.edu/people/szummer/ Notes: above divider: notes to say ---- below divider: possible changes to slide TODO: Summary slides to show location; recap slides Design: MS powerpoint help has nice formatting Checklist: . Use L+U instead of N (explicit notation) Tony Ezzat comments: . most questions concerned manifold learning (he predicted it) must reference and talk about Sam Roweis & Tenenbaum � that�s the work people now Notes: above divider: notes to say ---- below divider: possible changes to slide TODO: Summary slides to show location; recap slides Design: MS powerpoint help has nice formatting Checklist: . Use L+U instead of N (explicit notation) Tony Ezzat comments: . most questions concerned manifold learning (he predicted it) must reference and talk about Sam Roweis & Tenenbaum � that�s the work people now

2. 2 Detecting cars --- Sequences from F:\szummer\data\cars\cd1\Hpn1\04 and 22--- Sequences from F:\szummer\data\cars\cd1\Hpn1\04 and 22

3. 3 Outline The partially labeled data problem Data representations Markov random walk Classification criteria Information Regularization data representations: modeling assumptions --- Video sequences before this data representations: modeling assumptions --- Video sequences before this

4. 4 Learning from partially labeled data - semi-supervised learning Big question: how can unlabeled data help Want: to improve classification accuracy � learn with fewer examples �learning� algorithm = clustering OR classification ------------ Big question: how can unlabeled data help Want: to improve classification accuracy � learn with fewer examples �learning� algorithm = clustering OR classification ------------

5. 5 Semi-supervised learning from an unsupervised perspective labels constrain and repair clusters really 4 clusters Example: biologist with a task in mind: Let�s cluster gene expression data I already know gene TN3X and TN4L have similar function; cluster so to they fall into the same cluster! --- TODO: repair examples example a bit broken since x axis is not scaled; only distances between clusters are increased (would need to use Matlab to generate something better Less important: Constraints have form: A & B should belong to same/different cluster (pairwise) really 4 clusters Example: biologist with a task in mind: Let�s cluster gene expression data I already know gene TN3X and TN4L have similar function; cluster so to they fall into the same cluster! --- TODO: repair examples example a bit broken since x axis is not scaled; only distances between clusters are increased (would need to use Matlab to generate something better Less important: Constraints have form: A & B should belong to same/different cluster (pairwise)

6. 6 Semi-supervised learning from a supervised perspective

7. 7 Benefits of semi-supervised learning Labeled data can be expensive may require human labor, and additional experiments / measurements impossible to obtain labels unavailable at the present time; e.g. for prediction Unlabeled data can be abundant and cheap! e.g. image sequences from video cameras, text documents from the web Humans can learn with limited feedback --------- How? example: novel words in text can be understood using context Humans can learn with limited feedback --------- How? example: novel words in text can be understood using context

8. 8 Can we always benefit from partially labeled data? Not always! Assumptions required Labeled and unlabeled data drawn IID from same distribution Ignorable missingness mechanism and� might seem impossible! word statistics, new words and contexts of words ignorable missingness mechanism --- Draw graphical representation of text examplemight seem impossible! word statistics, new words and contexts of words ignorable missingness mechanism --- Draw graphical representation of text example

9. 9 Key assumption The structure in the unlabeled data must relate to the desired classification; specifically: A link between the marginal P(x) and the conditional P(y|x), which our classifier is equipped to exploit Marginal distribution P(x): describes the input domain Conditional distribution P(y|x): describes the classification Example assumption: points in the same cluster should have the same label Speculate: assumptions made for supervised learning � same as for semi-supervised learning, but just used in a stronger way -- Old: used to explain the jointSpeculate: assumptions made for supervised learning � same as for semi-supervised learning, but just used in a stronger way -- Old: used to explain the joint

10. 10 The learning task Transduction: not for real-time systems, but only way to fully exploit unlabeled dataTransduction: not for real-time systems, but only way to fully exploit unlabeled data

11. 11 The learning task: notation tilde over y: dentotes observed label here presented as transduction, but may want to learn function when some test points are not yet available and yet avoid retraining later Task: transduction � only need to predict values of function at particular points [Vapnik] --- Greedy approach to semisupervised learning; label the most confident at each stage - no confidence information (could have its own slide) tilde over y: dentotes observed label here presented as transduction, but may want to learn function when some test points are not yet available and yet avoid retraining later Task: transduction � only need to predict values of function at particular points [Vapnik] --- Greedy approach to semisupervised learning; label the most confident at each stage - no confidence information (could have its own slide)

12. 12 Previous approach: missing data with EM Maximize likelihood of a generative model that accounts for P(x) and P(x,y) Models P(x) and P(x,y) can be mixtures of Gaussians [Miller & Uyar], or Na�ve Bayes [Nigam et al] Issues: what model? How weight unlabeled vs. labeled? [Kowalski � extend the representation] [Kowalski � extend the representation]

13. 13 Previous approach: Large margin on unlabeled data Transduction with SVM or MED (max entropy discrimination) Issues: computational cost Link between P(x) and P(y|x) Large margin methods (SVM, boosting) Decision boundary preferentially lies in low-density regions of P(x) Optional: Semi-supervised boosting -- labels constrain and repair clusters unlabeled points regularize Constraints: this point should belong to a given class (pointwise)Link between P(x) and P(y|x) Large margin methods (SVM, boosting) Decision boundary preferentially lies in low-density regions of P(x) Optional: Semi-supervised boosting -- labels constrain and repair clusters unlabeled points regularize Constraints: this point should belong to a given class (pointwise)

14. 14 Outline The partially labeled data problem Data representations Markov random walk Classification criteria Information Regularization

15. 15 Unsupervised � uses x of all data points Supervised � uses y of all labeled data points needs representation of only labeled data points Fisher kernel � uses a similar approach train a generative model; then apply it in a classifier[Hofmann] Theorem: will get a good discriminative classifier Example: 1) Representation: as given, but normalize each data point 2a) Clustering: spectral method b) Classification: linear classifier ----- describe how linear classifier uses similarity of representation Unsupervised � uses x of all data points Supervised � uses y of all labeled data points needs representation of only labeled data points Fisher kernel � uses a similar approach train a generative model; then apply it in a classifier[Hofmann] Theorem: will get a good discriminative classifier Example: 1) Representation: as given, but normalize each data point 2a) Clustering: spectral method b) Classification: linear classifier ----- describe how linear classifier uses similarity of representation

16. 16 Clusters and low-dimensional structures Partially labeled learning can work by: unlabeled points uncover structure of the data,e.g. clusters, assumed to have generally homogenous but unknown labels labeled points suggest class of the clusters --- add kernel expansion output Focus more on clusters, not only on manifolds!Partially labeled learning can work by: unlabeled points uncover structure of the data,e.g. clusters, assumed to have generally homogenous but unknown labels labeled points suggest class of the clusters --- add kernel expansion output Focus more on clusters, not only on manifolds!

17. 17 Representation desiderata Conditional should follow the data manifold- data may lie in a low-dimensional subspace Example: neighborhood graph Robustly measure similarity between points.Consider volume of all paths, not just shortest path. Example: Markov random walk Variable resolution: adjustable cluster size or number(differentiate points at coarser scales, not at finer scales)Example: number of time steps t of Markov random walk determines whether two points appear indistinguishable Construct a representation P(i|xk) that satisfies these goals. ---- �Follow� data manifold (= Respect, Represent, Capture) Find structure of data Explicit / implicit clustering ---- �Follow� data manifold (= Respect, Represent, Capture) Find structure of data Explicit / implicit clustering

18. 18 Example: Markov random walk representation Local metrics are easier to define. How go from a local metric to a global one? RHS: d � we use Euclidean metric d not d^2 � additive metric Global representation: mixture model: each component �generates� / �causes� another point P_{0|t} (i|k) = P_{t|0} (k|i) P(i) /N so LHS is just renormalized represent a point as the probability of being generated by a set of components here: one component for each point; uniform starting probability given that random walk reaches a point k, what is the probability of having started at i. P_{0|t} (i|k) normalized st sums to 1 over i (starting pt) instead of k (end pt). this is what we�ll need for classification Note: all points must be available at training timeLocal metrics are easier to define. How go from a local metric to a global one? RHS: d � we use Euclidean metric d not d^2 � additive metric Global representation: mixture model: each component �generates� / �causes� another point P_{0|t} (i|k) = P_{t|0} (k|i) P(i) /N so LHS is just renormalized represent a point as the probability of being generated by a set of components here: one component for each point; uniform starting probability given that random walk reaches a point k, what is the probability of having started at i. P_{0|t} (i|k) normalized st sums to 1 over i (starting pt) instead of k (end pt). this is what we�ll need for classification Note: all points must be available at training time

19. 19 Representation Each point k is represented as a vector of (conditional) probabilities over the possible starting states i of a t step random walk ending up in k. Two points are similar ? their random walks have indistinguishable starting points If you cannot tell points apart, they wil have the same coordinate vector -> low distance. If you cannot tell points apart, they wil have the same coordinate vector -> low distance.

20. 20 Parameter: length of random walk t Higher t ? coarser representation; fewer clusters Limits: t = 0, ? (degenerate) Choosing t � based on unlabeled data alone diameter of graph mixing time of graph (2nd eigenvalue of transition matrix) Choosing t � based on both labeled + unlabeled data when labels are consistent over large regions ? t is high criteria: maximize likelihood, or margin, or cross-validation t regulates scale of clusters (indirectly their number) t=1 (just 1 time step transition) diameter of graph: can also use distance time to nearest labeled point guarantees we can transition from any point to any other point (w/i each connected component) Limits: t=0,1 (look at the formula for the matrix; notice A^0 = 1; A^1=A; A^t = stationary) but recall we must condition to get the formula t does not change spectral decomposition of graph Mixing time in graph: topology dependent level of mixing: L1 dist or rel L1 dist from stationary dist (TODO Check) Q: but how set parameters t >= max 1/(1-lambda_2) * (ln 1/p_i^infty + ln 1/epsilon) ----- display formula for mixing time?t regulates scale of clusters (indirectly their number) t=1 (just 1 time step transition) diameter of graph: can also use distance time to nearest labeled point guarantees we can transition from any point to any other point (w/i each connected component) Limits: t=0,1 (look at the formula for the matrix; notice A^0 = 1; A^1=A; A^t = stationary) but recall we must condition to get the formula t does not change spectral decomposition of graph Mixing time in graph: topology dependent level of mixing: L1 dist or rel L1 dist from stationary dist (TODO Check) Q: but how set parameters t >= max 1/(1-lambda_2) * (ln 1/p_i^infty + ln 1/epsilon) ----- display formula for mixing time?

21. 21 Parameter: local neighborhood size K kernel width s K = number of nearest neighbors K too low: disconnected components distorted sense of distances K too high: local neighborhood relation becomes inaccurate neighbor relation is made symmetric; self-transitions allowed s = local distance decay (random walk) Influences local smoothness of representation K too high: leaks in manifold symmetric neighborhood � want undirected graph Cross-validation for parameter settings � NO! ---- --- DO insert pictures instead sigma - kernel width (kernel expansion) Select kernel widths a) based on distance to K nearest neighbor global widths: use median distance [to opposite class] adaptive widths: shrink in high density regions, expand in low-density ones b) by theoretical analysis of smoothness of P(i|x) --- old 1/beta notation - removed K too high: leaks in manifold symmetric neighborhood � want undirected graph Cross-validation for parameter settings � NO! ---- --- DO insert pictures instead sigma - kernel width (kernel expansion) Select kernel widths a) based on distance to K nearest neighbor global widths: use median distance [to opposite class] adaptive widths: shrink in high density regions, expand in low-density ones b) by theoretical analysis of smoothness of P(i|x) --- old 1/beta notation - removed

22. 22 A Generative Model for the Labels Given: nodes i (corresponding to points xi ) Given: label distributions Q(y|i) at each node i Model generates a node identity and a label Draw a node identity i uniformly Draw a label y ~ Q(y|i) 2. Add t rounds of identity noise: node i is confused with node k according to P(k|i). Label y is intact. 3. Output final identity k, and the label y During classification: only the noisy node identity is observed, and we want to determine the label y.

23. 23 Given the noisy node identity k, infer possible starting node identities i, and weight their label distributions Question: how do we obtain Q(y|i)? Classification model Consider all random walks from other points that could have ended up in k in t steps. Assign a label to k based on the conditional probabilities over the starting points. How do we obtain parameter distributions: for labeled points, have the label � but for unlabeled points, know nothingConsider all random walks from other points that could have ended up in k in t steps. Assign a label to k based on the conditional probabilities over the starting points. How do we obtain parameter distributions: for labeled points, have the label � but for unlabeled points, know nothing

24. 24 Classification model (2) Unlike a linear classifier parameters Q(y|i) are bounded, limiting the effects of outliers classifier is directly applicable to multiple classes Link between P(x) and P(y|x): smoothness of the representation Q(y|i) estimated for labeled points too (previously just assumed a model for it) Benefits (from hidden slide) . uses the unlabeled examples . when kernels are chosen correctly, the estimate is consistent, Bayes optimal classifier ################### Other estimation criteria: * max joint likelihood * Bayesian estimation * any alg that maintains a probabilistic interpretation of P(y|i) --- Recall kernel density motivation: So far: labeled data only in kernel density estimate Now: relax kernel density assumption of labels for all points Unlike linear classifier: averaging behaviorQ(y|i) estimated for labeled points too (previously just assumed a model for it) Benefits (from hidden slide) . uses the unlabeled examples . when kernels are chosen correctly, the estimate is consistent, Bayes optimal classifier ################### Other estimation criteria: * max joint likelihood * Bayesian estimation * any alg that maintains a probabilistic interpretation of P(y|i) --- Recall kernel density motivation: So far: labeled data only in kernel density estimate Now: relax kernel density assumption of labels for all points Unlike linear classifier: averaging behavior

25. 25 Maximize conditional log-likelihood EM algorithm Much easier than EM w Gauss mixtures � here only estimate labels for component i Conditional / Discriminative model : since we only affect y part of the model i -> p(i|x_l) fixed -> p(y|i) learned Similar to estimating mixture weights Solution properties: P(y|i) become hard 0, 1 (or if cannot reach labels from a point � soft at 0.5,0.5 � initial condition) Much easier than EM w Gauss mixtures � here only estimate labels for component i Conditional / Discriminative model : since we only affect y part of the model i -> p(i|x_l) fixed -> p(y|i) learned Similar to estimating mixture weights Solution properties: P(y|i) become hard 0, 1 (or if cannot reach labels from a point � soft at 0.5,0.5 � initial condition)

26. 26 Swiss roll problem

27. 27 Swiss roll problem K=5 (symmetrized)K=5 (symmetrized)

28. 28 t=20

29. 29 t=10

30. 30 t=3

31. 31 Summary: Markov Random Walk representation Points are expressed as a vectors of probabilities, of having been generated by every other point Related work: Clustering Markovian relaxation [Tishby & Slonim 00] Spectral clustering [Shi & Malik 97; Meila & Shi 00; ++] Visualization: Isomap [Tenenbaum 99] Linear local embedding [Roweis & Saul 00]

32. 32 Outline The partially labeled data problem Data representations Kernel expansion Markov random walk Classification criteria conditional maximum likelihood with EM maximize average margin � Information Regularization How to train the classifier: how to train the Q(y|i) have already talked about maximum likelihood estimates How to train the classifier: how to train the Q(y|i) have already talked about maximum likelihood estimates

33. 33 Discriminative boundaries Focus on classification decisions more directly than maximum likelihood does Classify labeled points with a margin Margin at point xk : confidence of the classifier ML � objective is not related to classification task --- When EM slides present: (even more discriminative than ML � objective is not related to classification task --- When EM slides present: (even more discriminative than

34. 34 Margin based estimation maximize average margin margin definition � as in boosting (Schapire) margin � confidence measure � few functions f achieve high margin correct classification with margin gamma Unbalanced classes � must fix esp for average margin Closed form not surprising: linear programming has optima at extreme points --- Margin vs likelihood (see my notes) Q: How can we get away w linear program when SVM needs QPmargin definition � as in boosting (Schapire) margin � confidence measure � few functions f achieve high margin correct classification with margin gamma Unbalanced classes � must fix esp for average margin Closed form not surprising: linear programming has optima at extreme points --- Margin vs likelihood (see my notes) Q: How can we get away w linear program when SVM needs QP

35. 35 Average margin solution has a closed form Closed form: assign weight 1 to the class with largest total �flow� to point m. Two rounds of a weighted neighbor classifier Classify all points based on the labeled points Classify all points based on the previous classification

36. 36 Text classification with Markov random walks --- no EM on this figure--- no EM on this figure

37. 37 Choosing t based on margin

38. 38 Gene splice sites classification (t=1) Gene splice sites (500 examples, 100 dimensions) Leukemia (38 training, 34 test examples, 7000 dimensions) Procedure: vary # labeled examples averaged over 20 test runs Respond to objections: very many labeled points required; representation too flexible --- SVM reaches 8% error level after ____ examples MED unlabeled error bars NN Fix so that ps previewer works on Postscript (copy directly from Matlab fig instead of inserting Postscript?)Gene splice sites (500 examples, 100 dimensions) Leukemia (38 training, 34 test examples, 7000 dimensions) Procedure: vary # labeled examples averaged over 20 test runs Respond to objections: very many labeled points required; representation too flexible --- SVM reaches 8% error level after ____ examples MED unlabeled error bars NN Fix so that ps previewer works on Postscript (copy directly from Matlab fig instead of inserting Postscript?)

39. 39 Leukemia classification with kernel expansion Promising!Promising!

40. 40 Gene splice site (2) --- Q: Exponentially fast: Cover & Castelli --- Q: Exponentially fast: Cover & Castelli

41. 41 Car Detection

42. 42 Haar wavelet features

43. 43

44. 44

45. 45

46. 46

47. 47

48. 48

49. 49 Adaptive time scales Set time scale to maximize mutual information between label y and node identity k for unlabeled points only --------

50. 50 Outline The partially labeled data problem Data representations Kernel expansion Markov random walk Classification criteria Information Regularization

51. 51 Information Regularization Overview Markov random walk Linked P(x) to P(y|x) indirectly through the classification model Information Regularization Explicitly and directly links P(x) to P(y|x) Makes no parametric assumptions on the link ----- Minimizes information about the labels in covering regions Is computationally feasible for continuous P(x) ----- Minimizes information about the labels in covering regions Is computationally feasible for continuous P(x)

52. 52 Assumption: Inside small regions with a large number of points, the labeling should not change Regularization approach: Cover the domain with small regions, and penalize inhomogeneous labelings in the regions cluster assumption -- title: Explicitly linking marginal and conditionalcluster assumption -- title: Explicitly linking marginal and conditional

53. 53 Mutual information Mutual information I(x; y) over a region I(x; y) = how many bits of information does knowledge about x contribute to knowledge about y, on average I(x ; y) = H(y) � H(y|x), a function of P(x) and P(y|x) a measure of homogeneity of labels --- homogeneity � not only; I(x;y) depends on value of P(y|x). If P(y|x) close to 0.5 then same change in P(y|x) gives lower mutual information, than if P(y|x) is close to 1.--- homogeneity � not only; I(x;y) depends on value of P(y|x). If P(y|x) close to 0.5 then same change in P(y|x) gives lower mutual information, than if P(y|x) is close to 1.

54. 54 Mutual Information � a homogeneity measure Example: x = location within the circle; y ={+, �} Regularizer does not consider spatial configuration within the region, hence the regions must be small to provide spatial locality.Regularizer does not consider spatial configuration within the region, hence the regions must be small to provide spatial locality.

55. 55 Penalize weighted mutual information over a small region Q in the input domain MQ = probability mass of x in region Q high density region ? penalize more VQ = variance of x in region Q IQ/VQ is independent of size of Q as Q shrinks Information Regularization (in small region) in 1D M_Q � penalize --- high-D infomargin should formula for limiting arg be included? equals Fisher information of x about the labels in 1D M_Q � penalize --- high-D infomargin should formula for limiting arg be included? equals Fisher information of x about the labels

56. 56 Information Regularization (whole domain) Cover the domain with small overlapping regions Regularize each region Cover should be connected Example cover: balls centered at each data point --- fix multiple redraws, maybe by pasting new figure kNN ngh size picture size: 5�7--- fix multiple redraws, maybe by pasting new figure kNN ngh size picture size: 5�7

57. 57 Minimize Max Information Content Minimize the maximum information ? contained in any region Q in the cover Mention average margin formulation. Mention average margin formulation.

58. 58 Incorporating Noisy Labels Noise level b: from prior knowledge, or cross-validate Expected error � better! -- if have two conflicting labeled points at the same location � use label errorNoise level b: from prior knowledge, or cross-validate Expected error � better! -- if have two conflicting labeled points at the same location � use label error

59. 59 Solution Properties Atomic subregions solution P(y|x) is constant inside atomic subregions need only introduce one variable P(y|x) for each atomic subregion, and only for non-empty subregions can work with a given continuous P(x) computational feasibility: depends on cover and P(x)

60. 60 Implementation Constrained nonlinear optimization convex Newton method (BFGS) Dual problem shows structure of solution P(y|x) in an atomic subregion is a weighted geometric mean of label averages P(y|Q) of the regions Q that the subregion belongs to Cover: preliminary implementation in 1D, for a given continuous density

61. 61 Solution for mixture of Gaussians --- Filename: D1gauss2R40-talk.fig--- Filename: D1gauss2R40-talk.fig

62. 62

63. 63 How many regions are needed in the cover?

64. 64 Summary Partially labeled data problem: link P(x) with P(y|x) Kernel classifier for partially labeled data Markov random walk representation Associated parameter inference criteria Information regularization Experiments: partially labeled data helps!

65. 65 Conclusions Solutions to the partially labeled problem rely on assumptions at the core of machine learning Classification with the Markov random walk representation: works well for text and images; possibly a general alternative to Gaussian mixture models? Discriminative training via large margin techniques: can be done in closed form! Information regularization: a very general method of linking P(x) to P(y|x) Partially labeled data can significantly improve classification performance; enables new applications Implications: goes to the very core of machine learning � what are the fundamental assumptions between P(x) and P(y|x) Implications: goes to the very core of machine learning � what are the fundamental assumptions between P(x) and P(y|x)

66. 66 Future directions Related learning tasks Regression with partially labeled data Classification with known marginal P(x) Other types of missing label problems Noisy labels incorrect or probabilistic labels coarse labels (hierarchical labels) No labels anomaly detection positive labels, but no negative labels Active learning (query learning) learner asks for labels of unlabeled points it expects to be informative data comes predominantly from one class, but may contain outliers from other classes Missing data problems contain partial features x and/or labels y. --- anomaly detection � can data come from the other class? does not seem like it! Multiple instance learning only label sets of points example: label image positive if it contains target object anywheredata comes predominantly from one class, but may contain outliers from other classes Missing data problems contain partial features x and/or labels y. --- anomaly detection � can data come from the other class? does not seem like it! Multiple instance learning only label sets of points example: label image positive if it contains target object anywhere

67. 67 Acknowledgements Tommi Jaakkola Tommy Poggio Tom Minka Andy Crane --- Media lab logo--- Media lab logo

68. 68 Xtra slides Text dimensionality

Learning from Partially Labeled Data

Learning from Partially Labeled Data

Presentation Transcript

LEARNING FROM DATA

Semi-supervised Learning on Partially Labeled Imbalanced Data

Predictive Learning from Data

Learning From the Data

Predictive Learning from Data

Learning with Ambiguously Labeled Training Data

Learning From Data

LEARNING FROM NOISY DATA

Partially labeled classification with Markov random walks

Learning from Partially Labeled Data

Text Classification with Limited Labeled Data

Chapter 5: Partially-Supervised Learning

Learning from Labeled and Unlabeled Data using Graph Mincuts

Predictive Learning from Data

Predictive Learning from Data

Predictive Learning from Data

A Theoretical Model for Learning from Labeled and Unlabeled Data

Using Manifold Structure for Partially Labeled Classification

Predictive Learning from Data

Predictive Learning from Data