1 / 27

Bayesian Learning

Bayesian Learning. Rong Jin. Outline. MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging. Maximum Likelihood Learning (ML). Find the model that best model by maximizing the log-likelihood of the training data Logistic regression

dakota
Download Presentation

Bayesian Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Learning Rong Jin

  2. Outline • MAP learning vs. ML learning • Minimum description length principle • Bayes optimal classifier • Bagging

  3. Maximum Likelihood Learning (ML) • Find the model that best model by maximizing the log-likelihood of the training data • Logistic regression • Parameters are found by maximizing the likelihood of training data

  4. Prior for parameters Maximum A Posterior Learning (MAP) • In ML learning, models are solely determined by the training examples • Very often, we have prior knowledge/preference about parameters/models • ML learning is unable to incorporate the prior knowledge/preference on parameters/models • Maximum a posterior learning (MAP) • Knowledge/preference about parameters/models are incorporated through a prior

  5. Example: Logistic Regression • ML learning • Prior knowledge/Preference • No feature should dominate over all other features  Prefer small weights • Gaussian prior for parameters/models:

  6. Example: Logistic Regression • ML learning • Prior knowledge/Preference • No feature should dominate over all other features  Prefer small weights • Gaussian prior for parameters/models:

  7. Example (cont’d) • MAP learning for logistic regression • Compared to regularized logistic regression

  8. Example (cont’d) • MAP learning for logistic regression • Compared to regularized logistic regression

  9. Complexity of Model # of Mistakes Minimum Description Length Principle • Occam’s razor: prefer the simplest hypothesis • Simplest hypothesis  hypothesis with shortest description length • Minimum description length • Prefer shortest hypothesis • LC(x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h

  10. Minimum Description Length Principle Receiver Sender Send only D ? Send only h ? D Send h + D/h ?

  11. Example: Decision Tree • H = decision trees, D = training data labels • LC1(h) is # bits to describe tree h • LC2(D|h) is # bits to describe D given tree h • Note LC2(D|h)=0 if examples are classified perfectly by h. • Only need to describe exceptions • hMDL trades off tree size for training errors

  12. Description length of exceptions under optimal coding Description length of h under optimal coding MAP vs. MDL • MAP learning: • Fact from information theory • The optimal (shortest expected coding length) code for an event with probability p is –log2p • Interpret MAP using MDL principle

  13. Problems with Maximum Approaches • Consider • Three possible hypotheses: • Maximum approaches will pick h1 • Given new instance x • Maximum approaches will output + • However, is this most probably result?

  14. Bayes Optimal Classifier (Bayesian Average) • Bayes optimal classification: • Example: • The most probably class is -

  15. Bayes Optimal Classifier (Bayesian Average) • Bayes optimal classification: • Example: • The most probably class is -

  16. When do We Need Bayesian Average? • Bayes optimal classification When do we need Bayesian average? • Multiple mode case • Optimal mode is flat When NOT Bayesian Average? • Can’t estimate Pr(h|D) accurately

  17. Computational Issues with Bayes Optimal Classifier • Bayes optimal classification • Computational issues: • Need to sum over all possible models/hypotheses h • It is expensive or impossible when the model/hypothesis space is large • Example: decision tree • Solution: sampling !

  18. Gibbs Classifier • Gibbs algorithm • Choose one hypothesis at random, according to P(h|D) • Use this to classify new instance • Surprising fact: • Improve by sampling multiple hypotheses from P(h|D) and average their classification results • Markov chain Monte Carlo (MCMC) sampling • Importance sampling

  19. Bagging Classifiers • In general, sampling from P(h|D) is difficult because • P(h|D) is rather difficult to compute • Example: how to compute P(h|D) for decision tree? • P(h|D) is impossible to compute for non-probabilistic classifier such as SVM • P(h|D) is extremely small when hypothesis space is large • Bagging Classifiers: • Realize sampling P(h|D) through a sampling of training examples

  20. Boostrap Sampling • Bagging = Boostrap aggregating • Boostrap sampling: given set D containing m training examples • Create Diby drawing m examples at random with replacement from D • Diexpects to leave out about 0.37 of examples from D

  21. Bagging Algorithm • Create k boostrap samples D1,D2,…, Dk • Train distinct classifier hi on each Di • Classify new instance by classifier vote with equal weights

  22. Bayesian Average Bagging D P(h|D) Boostrap Sampling Sampling … … h1 h2 hk h1 hk D1 D2 Dk h2 Boostrap sampling is almost equivalent to sampling from posterior P(h|D) Bagging  Bayesian Average

  23. Empirical Study of Bagging • Bagging decision trees • Boostrap 50 different samples from the original training data • Learn a decision tree over each boostrap sample • Predicate the class labels for test instances by the majority vote of 50 decision trees • Bagging decision tree performances better than a single decision tree

  24. Irreducible variance Model bias: The simpler the (x|D), the larger the bias Model variance: The simpler the (x|D), the smaller the variance Bias-Variance Tradeoff • Why Bagging works better than a single classifier? • Bias-variance tradeoff • Real value case • Output y for x follows y~f(x)+, ~N(0,) • (x|D) is a predictor learned from training data D • Bias-variance decomposition

  25. Small model bias True Model Fit with Complicated Models Large model variance Bias-Variance Tradeoff

  26. Large model bias True Model Fit with Simple Models Small model variance Bias-Variance Tradeoff

  27. Bagging • Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree

More Related