1 / 42

Param. Learning (MLE) Structure Learning The Good

Readings: K&F: 16.1, 16.2, 17.1, 17.2, 17.3.1, 17.4.1. Param. Learning (MLE) Structure Learning The Good. Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st , 2008. Learning the CPTs. For each discrete variable X i. Data. x (1) … x (m). Learning the CPTs.

ferrol
Download Presentation

Param. Learning (MLE) Structure Learning The Good

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Readings: K&F: 16.1, 16.2, 17.1, 17.2, 17.3.1, 17.4.1 Param. Learning (MLE)Structure LearningThe Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1st, 2008 10-708 – Carlos Guestrin 2006-2008

  2. Learning the CPTs For each discrete variable Xi Data x(1) … x(m) 10-708 – Carlos Guestrin 2006-2008

  3. Learning the CPTs For each discrete variable Xi Data x(1) … x(m) WHY?????????? 10-708 – Carlos Guestrin 2006-2008

  4. Maximum likelihood estimation (MLE) of BN parameters – example Flu Allergy Sinus • Given structure, log likelihood of data: Nose 10-708 – Carlos Guestrin 2006-2008

  5. Maximum likelihood estimation (MLE) of BN parameters – General case • Data: x(1),…,x(m) • Restriction: x(j)[PaXi] ! assignment to PaXi in x(j) • Given structure, log likelihood of data: 10-708 – Carlos Guestrin 2006-2008

  6. Taking derivatives of MLE of BN parameters – General case 10-708 – Carlos Guestrin 2006-2008

  7. General MLE for a CPT • Take a CPT: P(X|U) • Log likelihood term for this CPT • Parameter X=x|U=u : 10-708 – Carlos Guestrin 2006-2008

  8. Where are we with learning BNs? • Given structure, estimate parameters • Maximum likelihood estimation • Later Bayesian learning • What about learning structure? 10-708 – Carlos Guestrin 2006-2008

  9. Possible structures Learning the structure of a BN Flu Allergy Data Sinus <x1(1),…,xn(1)> … <x1(m),…,xn(m)> Nose Headache • Constraint-based approach • BN encodes conditional independencies • Test conditional independencies in data • Find an I-map • Score-based approach • Finding a structure and parameters is a density estimation task • Evaluate model as we evaluated parameters • Maximum likelihood • Bayesian • etc. Learn structure and parameters 10-708 – Carlos Guestrin 2006-2008

  10. Remember: Obtaining a P-map? • Given the independence assertions that are true for P • Obtain skeleton • Obtain immoralities • From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class • Constraint-based approach: • Use Learn PDAG algorithm • Key question: Independence test 10-708 – Carlos Guestrin 2006-2008

  11. Score-based approach Flu Allergy Data Sinus <x1(1),…,xn(1)> … <x1(m),…,xn(m)> Nose Headache Learn parameters Score structure Possible structures 10-708 – Carlos Guestrin 2006-2008

  12. Information-theoretic interpretation of maximum likelihood Flu Allergy Sinus Nose Headache • Given structure, log likelihood of data: 10-708 – Carlos Guestrin 2006-2008

  13. Information-theoretic interpretation of maximum likelihood 2 Flu Allergy Sinus Nose Headache • Given structure, log likelihood of data: 10-708 – Carlos Guestrin 2006-2008

  14. Decomposable score • Log data likelihood • Decomposable score: • Decomposes over families in BN (node and its parents) • Will lead to significant computational efficiency!!! • Score(G : D) = i FamScore(Xi|PaXi : D) 10-708 – Carlos Guestrin 2006-2008

  15. Announcements • Recitation tomorrow • Don’t miss it! • HW2 • Out today • Due in 2 weeks • Projects!!!  • Proposals due Oct. 8th in class • Individually or groups of two • Details on course website • Project suggestions will be up soon!!! 10-708 – Carlos Guestrin 2006-2008

  16. BN code release!!!! • Pre-release of a C++ library for probabilistic inference and learning • Features: • basic datastructures (random variables, processes, linear algebra) • distributions (Gaussian, multinomial, ...) • basic graph structures (directed, undirected) • graphical models (Bayesian network, MRF, junction trees) • inference algorithms (variable elimination, loopy belief propagation, filtering) • Limited amount of learning (IPF, Chow Liu, order-based search) • Supported platforms: • Linux (tested on Ubuntu 8.04) • MacOS X (tested on 10.4/10.5) • limited Windows support • Will be made available to the class early next week. 10-708 – Carlos Guestrin 2006-2008

  17. How many trees are there? Nonetheless – Efficient optimal algorithm finds best tree 10-708 – Carlos Guestrin 2006-2008

  18. Scoring a tree 1: I-equivalent trees 10-708 – Carlos Guestrin 2006-2008

  19. Scoring a tree 2: similar trees 10-708 – Carlos Guestrin 2006-2008

  20. Chow-Liu tree learning algorithm 1 • For each pair of variables Xi,Xj • Compute empirical distribution: • Compute mutual information: • Define a graph • Nodes X1,…,Xn • Edge (i,j) gets weight 10-708 – Carlos Guestrin 2006-2008

  21. Chow-Liu tree learning algorithm 2 • Optimal tree BN • Compute maximum weight spanning tree • Directions in BN: pick any node as root, breadth-first-search defines directions 10-708 – Carlos Guestrin 2006-2008

  22. Can we extend Chow-Liu 1 • Tree augmented naïve Bayes (TAN) [Friedman et al. ’97] • Naïve Bayes model overcounts, because correlation between features not considered • Same as Chow-Liu, but score edges with: 10-708 – Carlos Guestrin 2006-2008

  23. Can we extend Chow-Liu 2 • (Approximately learning) models with tree-width up to k • [Chechetka & Guestrin ’07] • But, O(n2k+6) 10-708 – Carlos Guestrin 2006-2008

  24. What you need to know about learning BN structures so far • Decomposable scores • Maximum likelihood • Information theoretic interpretation • Best tree (Chow-Liu) • Best TAN • Nearly best k-treewidth (in O(N2k+6)) 10-708 – Carlos Guestrin 2006-2008

  25. Can we really trust MLE? m • What is better? • 3 heads, 2 tails • 30 heads, 20 tails • 3x1023 heads, 2x1023 tails • Many possible answers, we need distributions over possible parameters 10-708 – Carlos Guestrin 2006-2008

  26. Bayesian Learning • Use Bayes rule: • Or equivalently: 10-708 – Carlos Guestrin 2006-2008

  27. Bayesian Learning for Thumbtack • Likelihood function is simply Binomial: • What about prior? • Represent expert knowledge • Simple posterior form • Conjugate priors: • Closed-form representation of posterior (more details soon) • For Binomial, conjugate prior is Beta distribution 10-708 – Carlos Guestrin 2006-2008

  28. Beta prior distribution – P() • Likelihood function: • Posterior: 10-708 – Carlos Guestrin 2006-2008

  29. Posterior distribution • Prior: • Data: mH heads and mT tails • Posterior distribution: 10-708 – Carlos Guestrin 2006-2008

  30. Conjugate prior • Given likelihood function P(D|) • (Parametric) prior of the form P(|) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as: • P(|’), for some new set of parameters ’ • Prior: • Data: mH heads and mT tails (binomial likelihood) • Posterior distribution: 10-708 – Carlos Guestrin 2006-2008

  31. Using Bayesian posterior • Posterior distribution: • Bayesian inference: • No longer single parameter: • Integral is often hard to compute 10-708 – Carlos Guestrin 2006-2008

  32. Bayesian prediction of a new coin flip • Prior: • Observed mH heads, mT tails, what is probability of m+1 flip is heads? 10-708 – Carlos Guestrin 2006-2008

  33. Asymptotic behavior and equivalent sample size • Beta prior equivalent to extra thumbtack flips: • As m→1, prior is “forgotten” • But, for small sample size, prior is important! • Equivalent sample size: • Prior parameterized by H,T, or • m’ (equivalent sample size) and  Fix m’, change  Fix , change m’ 10-708 – Carlos Guestrin 2006-2008

  34. Bayesian learning corresponds to smoothing m • m=0 ) prior parameter • m!1) MLE 10-708 – Carlos Guestrin 2006-2008

  35. Bayesian learning for multinomial • What if you have a k sided coin??? • Likelihood function if multinomial: • Conjugate prior for multinomial is Dirichlet: • Observem data points, mi from assignment i, posterior: • Prediction: 10-708 – Carlos Guestrin 2006-2008

  36. Bayesian learning for two-node BN • Parameters X, Y|X • Priors: • P(X): • P(Y|X): 10-708 – Carlos Guestrin 2006-2008

  37. Very important assumption on prior:Global parameter independence • Global parameter independence: • Prior over parameters is product of prior over CPTs 10-708 – Carlos Guestrin 2006-2008

  38. Global parameter independence, d-separation and local prediction Flu Allergy Sinus Nose Headache • Independencies in meta BN: • Proposition: For fully observable data D, if prior satisfies global parameter independence, then 10-708 – Carlos Guestrin 2006-2008

  39. Within a CPT • Meta BN including CPT parameters: • Are Y|X=t and Y|X=f d-separated given D? • Are Y|X=t and Y|X=f independent given D? • Context-specific independence!!! • Posterior decomposes: 10-708 – Carlos Guestrin 2006-2008

  40. Priors for BN CPTs (more when we talk about structure learning) • Consider each CPT: P(X|U=u) • Conjugate prior: • Dirichlet(X=1|U=u,…, X=k|U=u) • More intuitive: • “prior data set” D’ with m’ equivalent sample size • “prior counts”: • prediction: 10-708 – Carlos Guestrin 2006-2008

  41. An example 10-708 – Carlos Guestrin 2006-2008

  42. What you need to know about parameter learning • MLE: • score decomposes according to CPTs • optimize each CPT separately • Bayesian parameter learning: • motivation for Bayesian approach • Bayesian prediction • conjugate priors, equivalent sample size • Bayesian learning ) smoothing • Bayesian learning for BN parameters • Global parameter independence • Decomposition of prediction according to CPTs • Decomposition within a CPT 10-708 – Carlos Guestrin 2006-2008

More Related