1 / 34

CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning. Structure Learning. Agenda. Learning probability distributions from example data To what extent can Bayes net structure be learned? Constraint methods (inferring conditional independence) Scoring methods (learning => optimization).

egan
Download Presentation

CS b553: Algorithms for Optimization and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b553: Algorithms for Optimization and Learning Structure Learning

  2. Agenda • Learning probability distributions from example data • To what extent can Bayes net structure be learned? • Constraint methods (inferring conditional independence) • Scoring methods (learning => optimization)

  3. Basic Question • Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G*?

  4. Basic Question • Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G* construct a network that encodes the same independence relations as G*? G* G1 G2  

  5. Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y

  6. Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11

  7. Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Errors are likely to be larger!

  8. Principle • Learning structure must trade off fit of data vs. complexity of network • Complex networks • More parameters to learn • More data fragmentation = greater sensitivity to noise

  9. Approach #1: Constraint-based learning • First, identify an undirected skeleton of edges in G* • If an edge X-Y is in G*, then no subset of evidence variables can make X and Y independent • If X-Y is not in G*, then we can find evidence variables to make X and Y independent • Then, assign directionality to preserve independences

  10. Build-Skeleton algorithm • Given X={X1,…,Xn}, query Independent?(X,Y,U) • H = complete graph over X • For all pairs Xi, Xj, test separation as follows: • Enumerate all possible separating sets U • If Independent?(Xi,Xj,U) then remove Xi—Xj from H • In practice: • Must restrict to bounded size subsets |U|d (i.e., assume G* has bounded degree). O(n2(n-2)d) tests • Independence can’t be tested exactly

  11. Assigning Directionality • Note that V-structures XYZ introduce a dependency between X and Z given Y • In structures XYZ, XYZ, and XYZ, X and Z are independent given Y • In fact Y must be given for X and Z to be independent • Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle X Z Y Directionality is irrelevant

  12. Assigning Directionality • Note that V-structures XYZ introduce a dependency between X and Z given Y • In structures XYZ, XYZ, and XYZ, X and Z are independent given Y • In fact Y must be given for X and Z to be independent • Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle Y separates X, Z X Z X Z Y Y Directionality is irrelevant Not a v-structure

  13. Assigning Directionality • Note that V-structures XYZ introduce a dependency between X and Z given Y • In structures XYZ, XYZ, and XYZ, X and Z are independent given Y • In fact Y must be given for X and Z to be independent • Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle Y separates X, Z YU separates X, Z X Z X Z X Z Y Y Y Directionality is irrelevant Not a v-structure A v-structure

  14. Assigning Directionality • Note that V-structures XYZ introduce a dependency between X and Z given Y • In structures XYZ, XYZ, and XYZ, X and Z are independent given Y • In fact Y must be given for X and Z to be independent • Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle Y separates X, Z YU separates X, Z X Z X Z X Z Y Y Y Directionality is irrelevant Not a v-structure A v-structure

  15. Statistical Independence Testing • Question: are X and Y independent? • Null hypothesis H0: X and Y are independent • Alternative hypothesis HA: X and Y are not independent

  16. Statistical Independence Testing • Question: are X and Y independent? • Null hypothesis H0: X and Y are independent • Alternative hypothesis HA: X and Y are not independent • 2 test: use the statistic withthe empirical probability of X • Can compute (table lookup) the probability of getting a value at least this extreme if H0 is true (p-value) • If p < some threshold, e.g. 1-0.95, H0 is rejected

  17. Approach #2: Score-based Methods • Learning => optimization • Define scoring function Score(G;D) that evaluates quality of structure G, and optimize it • Combinatorial optimization problem • Issues: • Choice of scoring function: maximum likelihood score, Bayesian score • Efficient optimization techniques

  18. Maximum-Likelihood scores • ScoreL(G;D) = likelihood of the BN with the most likely parameter settings under structure G • Let L(G,G;D) be the likelihood of data using parameters G with structure G • Let G* = arg maxL(,G;D) as described in last lecture • Then ScoreL(G;D) = L(G*,G;D)

  19. Issue with ML score

  20. Issue with ML Score • Independent coin example G1 G2 X Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Likelihood score log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20)+ 8 log (8/20) + 12 log (12/20) log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20)+ 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)

  21. Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]

  22. Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]

  23. Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]

  24. Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]=

  25. Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =

  26. Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =

  27. Mutual Information Properties (the mutual information between X and Y) with Q(x,y) = P(x)P(y)  0 by nonnegativity of KL divergence Implication: ML scores do not decrease for more connected graphs => Overfitting to data!

  28. Possible solutions • Fix complexity of graphs (e.g., bounded in-degree) • See HW7 • Penalize complex graphs • Bayesian scores

  29. Idea of Bayesian Scoring • Note that parameters are uncertain • Bayesian approach: put a prior on parameter values and marginalize them out • P(D|G) = • For example, use Beta/Dirichlet priors => marginal is manageable to compute • E.g., uniform hyperparameter over network • Set virtual counts to  2^-|PaXi|

  30. Large Sample Approximation • log P(D|G) = log L(G*;D) – ½ log M Dim[G] + O(1) • With M the number of samples, Dim[G] the number of free parameters of G • Bayesian Information Criterion (BIC) score: • ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]

  31. Large Sample Approximation • log P(D|G) = log L(G*;D) – ½ log M Dim[G] + O(1) • With M the number of samples, Dim[G] the number of free parameters of G • Bayesian Information Criterion (BIC) score: • ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G] Fit data set Prefer simple models

  32. Structure Optimization, Given a Score… • The problem is well-defined, but combinatorially complex! • Superexponential in # of variables • Idea: search locally through the space of graphs using graph operators • Add edge • Delete edge • Reverse edge

  33. Search Strategies • Greedy • Pick operator that leads to greatest  score • Local minima? Plateaux? • Overcoming plateaux • Search with basin flooding • Tabu search • Perturbation methods (similar to simulated annealing, except on data weighting) • Implementation details: • Evaluate ’s between structures quickly (local decomposibility)

  34. Recap • Bayes net structure learning: from equivalence class of networks that encode the same conditional independences • Constraint-based methods • Statistical independence tests • Score-based methods • Learning => optimization

More Related