1 / 59

Crash Course on Machine Learning Part VI

Crash Course on Machine Learning Part VI. Several slides from Derek Hoiem , Ben Taskar , Christopher Bishop, Lise Getoor. Graphical Models . Representations Bayes Nets Markov Random Fields Factor Graphs Inference Exact Variable Elimination BP for trees Approximate inference

brosh
Download Presentation

Crash Course on Machine Learning Part VI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crash Course on Machine LearningPart VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, LiseGetoor

  2. Graphical Models • Representations • Bayes Nets • Markov Random Fields • Factor Graphs • Inference • Exact • Variable Elimination • BP for trees • Approximate inference • Sampling • Loopy BP • Optimization • Learning (today)

  3. Exact Deterministic approximation Stochastic approximation Summary of inference methods BP=belief propagation, EP = expectation propagation, ADF = assumed density filtering, EKF = extended Kalman filter, UKF = unscented Kalman filter, VarElim = variable elimination, Jtree= junction tree, EM = expectation maximization, VB = variational Bayes, NBP = non-parametric BP Slide Credit: Kevin Murphy SP2-3

  4. Variable Elimination J M B E A

  5. The Sum-Product Algorithm Objective: • to obtain an efficient, exact inference algorithm for finding marginals; • in situations where several marginals are required, to allow computations to be shared efficiently. Key idea: Distributive Law

  6. The Sum-Product Algorithm

  7. The Sum-Product Algorithm

  8. The Sum-Product Algorithm • Initialization

  9. The Sum-Product Algorithm To compute local marginals: • Pick an arbitrary node as root • Compute and propagate messages from the leaf nodes to the root, storing received messages at every node. • Compute and propagate messages from the root to the leaf nodes, storing received messages at every node. • Compute the product of received messages at each node for which the marginal is required, and normalize if necessary.

  10. Loopy Belief Propagation • Sum-Product on general graphs. • Initial unit messages passed across all links, after which messages are passed around until convergence (not guaranteed!). • Approximate but tractable for large graphs. • Sometime works well, sometimes not at all.

  11. E B P(A | E,B) B E .9 .1 e b e b .7 .3 .8 .2 e b R A .99 .01 e b C Data + Prior information Learning Bayesian networks Inducer

  12. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b .99 .01 e b E B P(A | E,B) B B E E ? ? e b A A e b ? ? ? ? e b ? ? e b Known Structure -- Complete Data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> • Network structure is specified • Inducer needs to estimate parameters • Data does not contain missing values Inducer

  13. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b B E .99 .01 e b A E B P(A | E,B) B E ? ? e b A e b ? ? ? ? e b ? ? e b Unknown Structure -- Complete Data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> • Network structure is not specified • Inducer needs to select arcs & estimate parameters • Data does not contain missing values Inducer

  14. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b .99 .01 e b E B P(A | E,B) B B E E ? ? e b A A e b ? ? ? ? e b ? ? e b Known Structure -- Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> • Network structure is specified • Data contains missing values • We consider assignments to missing values Inducer

  15. Known Structure / Complete Data • Given a network structure G • And choice of parametric family for P(Xi|Pai) • Learn parameters for network Goal • Construct a network that is “closest” to probability that generated the data

  16. B E A C Learning Parameters for a Bayesian Network • Training data has the form:

  17. B E A C Learning Parameters for a Bayesian Network • Since we assume i.i.d. samples,likelihood function is

  18. B E A C Learning Parameters for a Bayesian Network • By definition of network, we get

  19. B E A C Learning Parameters for a Bayesian Network • Rewriting terms, we get

  20. General Bayesian Networks Generalizing for any Bayesian network: • The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

  21. General Bayesian Networks (Cont.) Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other.

  22. From Binomial to Multinomial • For example, suppose X can have the values 1,2,…,K • We want to learn the parameters 1, 2. …, K Sufficient statistics: • N1, N2, …, NK - the number of times each outcome is observed Likelihood function: MLE:

  23. Likelihood for Multinomial Networks • When we assume that P(Xi | Pai) is multinomial, we get further decomposition:

  24. Likelihood for Multinomial Networks • When we assume that P(Xi | Pai) is multinomial, we get further decomposition: • For each value paiof the parents of Xi we get an independent multinomial problem • The MLE is

  25. Bayesian Approach: Dirichlet Priors • Recall that the likelihood function is • A Dirichlet prior with hyperparameters1,…,K is defined as for legal 1,…, K Then the posterior has the same form, with hyperparameters1+N 1,…,K +N K

  26. Dirichlet Priors (cont.) • We can compute the prediction on a new event in closed form: • If P() is Dirichlet with hyperparameters1,…,K then • Since the posterior is also Dirichlet, we get

  27. Bayesian Prediction(cont.) • Given these observations, we can compute the posterior for each multinomial Xi | pai independently • The posterior is Dirichlet with parameters (Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai) • The predictive distribution is then represented by the parameters

  28. Bayesian (Dirichlet) MLE Learning Parameters: Summary • Estimation relies on sufficient statistics • For multinomial these are of the form N (xi,pai) • Parameter estimation • Bayesian methods also require choice of priors • Both MLE and Bayesian are asymptotically equivalent and consistent • Both can be implemented in an on-line manner by accumulating sufficient statistics

  29. Learning Structure from Complete Data

  30. Benefits of Learning Structure • Efficient learning -- more accurate models with less data • Compare: P(A) and P(B) vs. joint P(A,B) • Discover structural properties of the domain • Ordering of events • Relevance • Identifying independencies  faster inference • Predict effect of actions • Involves learning causal relationship among variables

  31. Increases the number of parameters to be fitted Wrong assumptions about causality and domain structure Cannot be compensated by accurate fitting of parameters Also misses causality and domain structure Truth Earthquake Earthquake Alarm Set AlarmSet Burglary Burglary Earthquake Alarm Set Burglary Sound Sound Sound Why Struggle for Accurate Structure? Adding an arc Missing an arc

  32. Approaches to Learning Structure • Constraint based • Perform tests of conditional independence • Search for a network that is consistent with the observed dependencies and independencies • Pros & Cons • Intuitive, follows closely the construction of BNs • Separates structure learning from the form of the independence tests • Sensitive to errors in individual tests

  33. Approaches to Learning Structure • Score based • Define a score that evaluates how well the (in)dependencies in a structure match the observations • Search for a structure that maximizes the score • Pros & Cons • Statistically motivated • Can make compromises • Takes the structure of conditional probabilities into account • Computationally hard

  34. Likelihood Score for Structures First cut approach: • Use likelihood function • Since we know how to maximize parameters from now we assume

  35. Likelihood Score for Structure (cont.) Rearranging terms: where • H(X) is the entropy of X • I(X;Y) is the mutual information between X and Y • I(X;Y) measures how much “information” each variables provides about the other • I(X;Y)  0 • I(X;Y) = 0iffX and Y are independent • I(X;Y) = H(X) iffX is totally predictable given Y

  36. Likelihood Score for Structure (cont.) Good news: • Intuitive explanation of likelihood score: • The larger the dependency of each variable on its parents, the higher the score • Likelihood as a compromise among dependencies, based on their strength

  37. Likelihood Score for Structure (cont.) Bad news: • Adding arcs always helps • I(X;Y)  I(X;Y,Z) • Maximal score attained by fully connected networks • Such networks can overfit the data --- parameters capture the noise in the data

  38. Avoiding Overfitting “Classic” issue in learning. Approaches: • Restricting the hypotheses space • Limits the overfitting capability of the learner • Example: restrict # of parents or # of parameters • Minimum description length • Description length measures complexity • Prefer models that compactly describes the training data • Bayesian methods • Average over all possible parameter values • Use prior knowledge

  39. Bayesian Inference • Bayesian Reasoning---compute expectation over unknown G • Assumption: Gs are mutually exclusive and exhaustive • We know how to compute P(x[M+1]|G,D) • Same as prediction with fixed structure • How do we compute P(G|D)?

  40. Posterior Score • Using Bayes rule: • P(D) is the same for all structures G • Can be ignored when comparing structures Prior over structures Marginal likelihood Probability of Data

  41. Likelihood Prior over parameters Marginal Likelihood • By introduction of variables, we have that • This integral measures sensitivity to choice of parameters

  42. Marginal Likelihood for General Network The marginal likelihood has the form: where • N(..) are the counts from the data • (..) are the hyperparameters for each family given G Dirichlet Marginal Likelihood For the sequence of values of Xi when Xi’s parents have a particular value

  43. Bayesian Score Theorem: If the prior P( |G) is “well-behaved”, then

  44. Asymptotic Behavior: Consequences • Bayesian score is consistent • As M  the “true” structure G* maximizes the score (almost surely) • For sufficiently large M, the maximal scoring structures are equivalent to G* • Observed data eventually overrides prior information • Assuming that the prior assigns positive probability to all cases

  45. Asymptotic Behavior • This score can also be justified by the Minimal Description Length (MDL) principle • This equation explicitly shows the tradeoff between • Fitness to data --- likelihood term • Penalty for complexity --- regularization term

  46. Priors • Over network structure • Minor role • Uniform, • Over parameters • Separate prior for each structure is not feasible • Fixed dirichletdistributions • BDe Prior

  47. BDe Score Possible solution: The BDe prior • Represent prior using two elements M0, B0 • M0 - equivalent sample size • B0 - network representing the prior probability of events

  48. BDe Score Intuition:M0 prior examples distributed by B0 • Set (xi,pai) = M0 P(xi,ai ) • Compute P(xi,paiG| B0) using standard inference procedures • Such priors have desirable theoretical properties • Equivalent networks are assigned the same score

  49. Scores -- Summary • Likelihood, MDL, (log) BDe have the form • BDe requires assessing prior network.It can naturally incorporate prior knowledge and previous experience • BDe is consistent and asymptotically equivalent (up to a constant) to MDL • All are score-equivalent • G equivalent to G’Score(G) = Score(G’)

  50. Optimization Problem Input: • Training data • Scoring function (including priors, if needed) • Set of possible structures • Including prior knowledge about structure Output: • A network (or networks) that maximize the score Key Property: • Decomposability: the score of a network is a sum of terms.

More Related