1 / 28

CS6772 Advanced Machine Learning Fall 2006

1 / 28. CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry Rafkind. 2 / 28. Presentation Outline. Maximum Entropy Discrimination Transduction with MED / SVMs

gregoryjohn
Download Presentation

CS6772 Advanced Machine Learning Fall 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1 / 28 CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry Rafkind

  2. 2 / 28 Presentation Outline • Maximum Entropy Discrimination • Transduction with MED / SVMs • Application to Yeast Protein Classification • Toy Experiments with Transduction • Conclusions

  3. 3 / 28 Discriminative Classifier with Gaussians Discriminative Classifier : Log-Likelihood of Ratio of Gaussians

  4. 4 / 28 Discriminative Classifier with Gaussians Specify discriminant function by choice of One approach from regularization theory would choose to agree with labels Such that where Determines the margin While minimizing the regularization function

  5. 5 / 28 Maximum Entropy Discrimination (MED) In MED, we solve for a distribution over solutions, P() such that the expected value of the distriminant under this distribution agrees with the labeling.

  6. 6 / 28 Maximum Entropy Discrimination (MED) In addition to finding a P() that satisfies classification constraints in the expectation, MED regularizes the solution distribution P() by either maximizing its entropy or minimizing its relative entropy toward some prior target distribution P0()

  7. 7 / 28 Maximum Entropy Discrimination (MED) Minimize relative entropy toward some prior target distribution P0() by finding relative Shannon entropy (KL divergence) given by ... Note that minimizing relative entropy is more general since choosing uniform P0() gives maximum entropy.

  8. 8 / 28 Maximum Entropy Discrimination (MED) Thus, MED solves the constrained optimization problem : Which projects the prior P0() to the closest point in the admissible set or convex hull defined by the above t = 1..T constraints • P0() P()

  9. 9 / 28 Maximum Entropy Discrimination (MED) The solution for the posterior P() is the standard maximum entropy setting

  10. 10 / 28 Maximum Entropy Discrimination (MED) The solution for the posterior P() is the standard maximum entropy setting The partition function Z() normalizes P() . MED finds the optimal setting of the  Lagrange multipliers (t for t = 1..T) by maximizing the concave objective function : J() = - ln Z()

  11. 11 / 28 Maximum Entropy Discrimination (MED) Given , the solution distribution P() is fully specified. We can then use this for predicting the label of a new data point X via the equation:

  12. 12 / 28 Maximum Entropy Discrimination (MED) SVMs as a Special Case of MED • Interestingly, applying MED to a ratio of Gaussians exactly reproduces support vector machines. • We simply assume the prior distribution factorizes into a prior over the vector parameters and a prior over the scalar bias : • The first two priors are white zero-mean Gaussians over the means which encourages means of low magnitude for our Gaussians. • The last prior is a non-informative (i.e. flat) prior • indicating that any scalar bias is equally probable a priori. • The resulting objective function is:

  13. 13 / 28 Maximum Entropy Discrimination (MED) Non-Separable Cases • To work with non-separable problems, we use a distribution over margins in the prior and posterior instead of simply setting them equal to a constant (which is like using a delta-function prior) • The MED solution distribution then involves an augmented theta which includes all margin variables as follows: • The formula for the partition function Z(k) is as above except we now have the following factorized prior distribution • The margin priors are chosen to favor large margins

  14. 14 / 28 Maximum Entropy Discrimination (MED) Discriminative Latent Likelihood Ratio Classifiers Consider a discriminant that is a ratio of two mixture models • Computing the partition function for mixtures becomes intractable with exponentially many terms. • To compensate, we can use Jensen's inequality and variational methods. • Jensen is first applied in the primal MED problem to tighten the classification constraints. • Then, Jensen is applied to the dual MED problem to yield a tractable projection.

  15. 15 / 28 Transduction with Gaussian Ratio Discriminants Classification is Straightfoward When All Labels are Known

  16. 16 / 28 Transduction with Gaussian Ratio Discriminants But labeling data is expensive and usually we have many unlabeled data points that might still be useful to us.

  17. 17 / 28 Transduction with Gaussian Ratio Discriminants Transductive learners can take advantage of unlabeled data to capture the distribution of each class better ………… but how?

  18. 18 / 28 A Principled Approach to Transduction • Uncertain labels can be handled in a principled way within the MED formalism: • Let y = {y1,...,yT) be a set of binary variables corresponding to the labels for the training examples • We can define a prior uncertainty over the labels by specifying P0(y). • For simplicity, we can take this to be a product distribution • Pt,0(yt) = 1 if the label is known and 1/2 otherwise

  19. 19 / 28 A Principled Approach to Transduction The MED solution is found by calculating the relative entropy projection from the overall prior distribution : To the admissible set of distributions P (no longer directly a function of the labels) that are consistent with the constraints (for all t = 1…T): A feasible solution has been proposed for this using a mean field approximation in a 2-step process.

  20. 20 / 28 Thorsten Joachims’ Approach to Transductive SVMs

  21. 21 / 28 Thorsten Joachims’ Approach to Transductive SVMs • Start by training an inductive SVM on the labeled training data and classifying the unlabeled test data accordingly. • Then uniformly increase the influence of the test examples by incrementing the cost-factors C*- and C*+ up to the user-defined value of C*. • A criterion condition identifies pairs of examples for which changing the class labels • leads to a decrease in the current objective function and then switches their labels.

  22. 22 / 28 Application of MED to Yeast Protein Classification • A Comparison was Performed among 3 Methods • The latent MED approach (without transduction) • SVMs with single kernels • Semi-Definite Programming (SDP) with a stationary mixture of kernels • Trained one-versus-all classifiers on 3 functional classes of yeast genetic data from the Noble Research Lab, University of Washington. Classes: • Energy • Interaction with Cellular Environment • Control of Cellular Organization • Found that MED surpassed the performance of SVMs with single kernels, but SDP still did the best. My goal is to extend MED with transduction to try to improve its accuracy further.

  23. 23 / 28 Toy Experiments with Transduction I have been working with Darrin Lewis in Prof. Jebara’s Machine Learning research lab. Since he already has MED code that works, we would like to extend it to incorporate transduction. Before we start changing his code, I am familiarizing myself with some simple transductive SVM algorithms on toy data.

  24. 24 / 28 Toy Experiments with Transduction

  25. 25 / 28 Toy Experiments with Transduction • Idea : simple transductive methods should be evaluated first if only for comparison with the more complex principled approaches. • A simple transduction algorithm: • Step 1 : Train on labeled data. • Step 2 : Test on all data (labeled + unlabeled) to get inductive accuracy. • Step 3: Apply predicted labels to unlabeled data and retrain. • Step 4: Test new classifier on all data and find accuracy. • Step 5: Repeat from Step 3 for a certain # of iterations.

  26. 26 / 28 Toy Experiments with Transduction In this case, transduction does worse than induction (the first observation)

  27. 27 / 28 Conclusions • Latent Maximum Entropy Discrimination with Mixtures of Gaussians can be extended with Transduction by incorporating distributions over the labels • Transduction can sometimes be helpful for incorporating knowledge about the distribution of unlabeled data into our learning approach • MED is currently inferior to SDP for the protein classification task. Perhaps transduction can improve MED’s results. • Further analysis should be done on simple transductive methods for comparison with more complicated, more principled ones. • I need more sleep. Good night!

  28. 28 / 28 References Jebara, T., Lewis D., Noble W., "Max Margin Mixture Models and Non-Stationary Kernel Selection", NIPS 2005, Columbia University Jaakkola T., Meila M., and Jebara T., "Maximum Entropy Discrimination". To appear in Neural Information Processing Systems 12 (NIPS '99) , Denver, CO, December 1999. T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning (ICML), 1999 Questions?

More Related