1 / 16

Introduction

The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept . of Physics of Complex Systems Weizmann Inst. of Science ^Broad Inst. Of MIT and Harvard. Introduction. X 2. X 1. 0. 1. 0. 0.95. 0.05. 1. 0.2. 0.8.

lucia
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The sample complexity of learning Bayesian NetworksOr Zuk*^, Shiri Margel* and Eytan Domany**Dept. of Physics of Complex SystemsWeizmann Inst. of Science^Broad Inst. Of MIT and Harvard .

  2. Introduction X2 X1 0 1 0 0.95 0.05 1 0.2 0.8 • Let X1,..,Xn be binary random variables. • A Bayesian Network is a pair B ≡ <G, θ>. • G – Directed Acyclic Graph (DAG). G = <V,E>. V = {X1,..,Xn} the vertex set. PaG(i) is the set of vertices Xj s.t. (Xj,Xi) in E. • θ - Parameterization. Represent conditional probabilities: • Together, they define a unique joint probability distribution PB over the n random variables. X1 X2 X3 X5 {X1,X4} | {X2,X3} X4 X5

  3. Structure Learning • We looked at a score based approach: • For each graph G, one gives a score based on the data S(G) ≡ SN(G; D) (N is the sample size) • Score is composed of two components: 1. Data fitting (log-likelihood) LLN(G;D) = max LLN(G,Ө;D)‏ 2. Model complexity Ψ(N) |G| |G| = The Dimension. # parameters in (G,Ө). SN(G) = LLN(G;D) - Ψ(N) |G| • This is known as the MDL (Minimum Description Length) score. Assumption : 1 << Ψ(N) << N. Score is consistent. • Of special interest: Ψ(N) = ½log N. The BIC score (Bayesian Information Criteria) is asymptotically equivalent to the Bayesian score.

  4. Previous Work • [Friedman&Yakhini 96] Unknown structure, no hidden variables. [Dasgupta 97] Known structure, Hidden variables. [Hoeffgen, 93] Unknown structure, no hidden variables. [Abbeel et al. 05] Factor graphs [Greiner et al. 97] classification error. • Concentrated on approximating the generative distribution. Typical results: N > N0(ε,δ) D(Ptrue || Plearned) < ε, w.p. > 1- δ. D – some distance between distributions. Usually relative entropy (we use relative entropy from now on). • We are interested in learning the correct structure. Intuition and practice  A difficult problem (both computationally and statistically.)‏ Empirical study: [Dai et al. IJCAI 97] New: [Wainwright et al. 06], [Bresler et al. 08] – undirected graphs

  5. Structure Learning • Assume data is generated from B* = <G*,Ө*>, with PB* generative distribution. Assume further that G* is minimal w. resp. to PB* : |G*| = min {|G| , PB* subset of M(G))‏ • An error occurs when any ‘wrong’ graph G is preferred over G*. Many possible G’s. Complicated relations between them. • Observation: Directed graphical models (with no hidden variables) are curved exponential families [Geiger et al. 01]. • [Haughton 88] – The MDL score is consistent. • [Haughton 89] – Bounds on the error probabilities: P(N)(under-fitting) ~ O(e-αN)‏ ; P(N)(over-fitting) ~ O(N-β)‏ Previously: Bounds only on β. Not on α, nor on the multiplicative constants.

  6. Structure Learning Simulations: 4-Nodes Networks. Totally 543 DAGs, in 185 equivalence classes. • Draw at random a DAG G*. • Draw all parameters θ uniformly from [0,1]. • Generate 5,000 samples from P<G*,θ> • Gives scores SN(G) to all G’s and look at SN(G*)

  7. Structure Learning • Relative entropy between the true and learned distributions: • Fraction of Edge Learned Correctly • Rank of the correct structure (equiv. class):

  8. All DAGs and Equivalence Classes for 3 Nodes

  9. Two Types of Error • An error occurs when any ‘wrong’ graph G is preferred over G*. Many possible G’s. Study them one by one. • Distinguish between two types of errors: 1. Graphs G which are not I-maps for PB* (‘under-fitting’). These graphs impose to many independency relations, some of which do not hold in PB*. 2. Graphs G which are I-maps for PB* (‘over-fitting’), yet they are over parameterized, |G| > |G*| • Study each error separately.

  10. 'Under-fitting' Errors 1. Graphs G which are not I-maps for PB* • Intuitively, in order to get SN(G*) > SN(G), we need: a. P(N) to be closer to PB* than to any point Q in G b. The penalty difference Ψ(N) (|G| - |G*|) is small enough. (Only relevant for |G*| > |G|). • For a., use concentration bounds (Sanov). For b., simple algebraic manipulations.

  11. 'Under-fitting' Errors 1. Graphs G which are not I-maps for PB* • Sanov's Theorem [Sanov 57]: Draw N sample from a probability distribution P. Let P(N) be the sample distribution. Then: Pr( D(P(N) || P) > ε) < N(n+1) 2-εN • Used in our case to show: (for some c>0)‏ • For |G| ≤ |G*|, we are able to bound c:

  12. 'Under-fitting' Errors • Upper-bound on decay exponent: c≤D(G||PB*)log 2. Could be very slow if G is close to PB* • Lower-bound: Use Chernoff Bounds to bound the difference between the true and sample entropies. • Two important parameters of the network: a. ‘Minimal probability’: b. ‘Minimal edge information’:

  13. 'Over-fitting' Errors 2. Graphs G which are over-parameterized I-maps for PB* • Here errors are Moderate deviations events, as opposed to Large deviations events in the previous case. • The probability of error does not decay exponentially with N, but is O(N-β). • By [Woodroofe 78], β=½(|G|-|G*|). • Therefore, for large enough values of N, error is dominated by over-fitting.

  14. Example G1 G2 G* X1 X1 X1 X2 X3 X2 X3 X2 X3 X4 X4 X4 What happens for small values of N? • Perform simulations: • Take a BN over 4 binary nodes. • Look at two wrong models

  15. Example Errors become rare events. Simulate using importance sampling (30 iterations): [Zuk et al. UAI 06]

  16. Recent Results/Future Directions • Want to minimize sum of errors (‘over-fitting’+’under-fitting’). Change penalty in the MDL score to Ψ(N) = ½log N – c log log N • # variables n >> 1. Small Max. degree # parents ≤ d. • Simulations for trees (computationally efficient: Chow-Liu)‏ • Hidden variables – Even more basic questions (e.g. identifiably, consistency) are unknown generally . • Requiring exact model was maybe to strict – perhaps it is likely to learn wrong models which are close to the correct one. If we require only to learn 1-ε of the edges – how does this reduce sample complexity?

More Related