1 / 100

Learning with Limited Supervision by Input and Output Coding

Learning with Limited Supervision by Input and Output Coding. Yi Zhang Machine Learning Department Carnegie Mellon University April 30 th , 2012. Thesis Committee. Jeff Schneider, Chair Geoff Gordon Tom Mitchell Xiaojin (Jerry) Zhu, University of Wisconsin-Madison. Introduction.

Download Presentation

Learning with Limited Supervision by Input and Output Coding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning with Limited Supervision by Input and Output Coding Yi Zhang Machine Learning Department Carnegie Mellon University April 30th, 2012

  2. Thesis Committee • Jeff Schneider, Chair • Geoff Gordon • Tom Mitchell • Xiaojin (Jerry) Zhu, University of Wisconsin-Madison

  3. Introduction (x1,y1) … (xn,yn) • Learning a prediction system, usually based on examples • Training examples are usually limited • Cost of obtaining high-quality examples • Complexity of the prediction problem Y Learning X

  4. Introduction (x1,y1) … (xn,yn) • Solution: exploit extra information about the input and output space • Improve the prediction performance • Reduce the cost for collecting training examples Y Learning X

  5. Introduction (x1,y1) … (xn,yn) ? ? • Solution: exploit extra information about the input and output space • Representation and discovery? • Incorporation? Y Learning X

  6. Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  7. Regularization • The general formulation • Ridge regression • Lasso

  8. Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  9. Learning with unlabeled text • For a text classification task • : plenty of unlabeled text on the Web • : seemingly unrelated to the task • What can we gain from such unlabeled text? Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008

  10. A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why?

  11. A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why? • Gasoline ~ gallon, truck ~ vehicle

  12. A covariance operator for regularization • Covariance structure of model coefficients • Usually unknown -- learn from unlabeled text?

  13. Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words

  14. Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words • For a new task, we learn with regularization

  15. Experiments • Empirical results on 20 newsgroups • 190 1-vs-1 classification tasks, 2% labeled examples • For any task, majority of unlabeled text (18/20) is irrelevant • Similar results on logistic regression and least squares [1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006

  16. Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  17. Multi-task learning • Different but related prediction tasks • An example • Landmine detection using radar images • Multiple tasks: different landmine fields • Geographic conditions • Landmine types • Goal: information sharing among tasks

  18. Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix W =

  19. Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix • A covariance operator for regularizing a matrix? • Vector w: • Matrix W: W = (Gaussian prior) ? Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010

  20. Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and full covariance row covariance column covariance ≈

  21. Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and • The matrix-normal density offers a compact form for full covariance row covariance column covariance ≈

  22. Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization Matrix-normal prior

  23. Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization • Other recent work as variants of special cases • Multi-task feature learning [Argyriou et al, NIPS 06]: learning with the feature covariance • Clustered multi-task learning [Jacob et al, NIPS 08]: learning with the task covariance and spectral constraints • Multi-task relationship learning [Zhang et al, UAI 10]: learning with the task covariance Matrix-normal prior

  24. Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W

  25. Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W • Alternating optimization • Estimating W: same as before • Estimating and : L-1 penalized covariance estimation

  26. Results on multi-task learning • Landmine detection: multiple landmine fields • Face recognition: multiple 1-vs-1 tasks [1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008 [2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006

  27. Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  28. Learning compressible models • Learning compressible models • A compression operator P instead of • Bias: model compressibility Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010

  29. Energy compaction • Image energy is concentrated at a few frequencies JPEG (2D-DCT), 46 : 1 compression

  30. Energy compaction • Image energy is concentrated at a few frequencies • Models need to operate at relevant frequencies JPEG (2D-DCT), 46 : 1 compression 2D-DCT

  31. Digit recognition: • Sparse vs. compressible • Model coefficients w sparse vs compressible sparse vs compressible sparse vs compressible compressed coefficients Pw coefficients w coefficients w as an image

  32. Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  33. Dimension reduction • Dimension reduction conveys information about the input space • Feature selection  importance • Feature clustering  granularity • Feature extraction  more general structures

  34. How to use a dimension reduction? • However, any reduction loses certain information • May be relevant to a prediction task • Goal of projection penalties: • Encode useful information from a dimension reduction • Control the risk of potential information loss Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

  35. Projection penalties: the basic idea • The basic idea: • Observation: reduce the feature space  restrict the model search to a model subspace MP • Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP

  36. Projection penalties: linear cases • Learn with a (linear) dimension reduction P

  37. Projection penalties: linear cases • Learn with projection penalties • Optimization: projection distance

  38. Projection penalties: nonlinear cases w MP M P wP Rd Rp P ? F’ F X P ? F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

  39. Projection penalties: nonlinear cases w MP M P wP Rd Rp M w MP P wP F’ F X w MP M P wP F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

  40. 2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Projection Penalty Original Reduction Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors Projection Penalty Projection Penalty Original Original Reduction Reduction

  41. 2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Orig Red Proj Orig Red Proj Orig Red Proj Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors • Similar results on face recognition, using SVM (poly-2) • Dimension reduction: KPCA, KDA, OLaplacian Face • Similar results on house price prediction, using regression • Dimension reduction: PCA and partial least squares

  42. Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  43. Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  44. Multi-label classification • Multi-label classification • Existence of certain label dependency • Example: classify an image into scenes (deserts, river, forest, etc) • Multi-class problem is a special case: only one class is true Label dependency Learn to predict … x y1 y2 yq

  45. Output coding • d < q: compression, i.e., source coding • d > q: error-correcting codes, i.e., channel coding • Use the redundancy to correct prediction (“transmission”) errors Learn to predict … x z z2 z3 zd z1 encoding decoding … y1 y2 yq y

  46. Error-correcting output codes (ECOCs) • Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001] • Encode into a (redundant) set of binary problems • Learn to predict the code • Decode the predictions • Our goal: design ECOCs for multi-label classification y1 y2 vs. y3 {y3,y4} vs. y7 Learn to predict … … x z1 z2 zt encoding decoding … y1 y2 yq

  47. Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding

  48. Composite likelihood • The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods • e.g., pairwise likelihood: • e.g., full conditional likelihood • Estimation using composite likelihoods • Computational and statistical efficiency • Robustness under model misspecification

  49. Multi-label problem decomposition • Problem decomposition methods • Decomposition into subproblems (encoding) • Decision making by combining subproblem predictions (decoding) • Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc … … … … Learn to predict x … y1 y2 yq

  50. 1-vs-All (Binary Relevance) Independently • Classify each label independently • The composite likelihood view Learn to predict … x y1 y2 yq

More Related