1 / 40

Probabilistic Graphical Models

Probabilistic Graphical Models. Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex systems are built by combining simpler parts Why have a model? Compact and modular representation of complex systems

abba
Download Presentation

Probabilistic Graphical Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Graphical Models • Tool for representing complex systems and performing sophisticated reasoning tasks • Fundamental notion: Modularity • Complex systems are built by combining simpler parts • Why have a model? • Compact and modular representation of complex systems • Ability to execute complex reasoning patterns • Make predictions • Generalize from particular problem

  2. Probability Theory • Probability distribution P over (, S) is a mapping from events in S such that: • P() 0 for all S • P() = 1 • If ,S and =, then P()=P()+P() • Conditional Probability: • Chain Rule: • Bayes Rule: • Conditional Independence:

  3. Random Variables & Notation • Random variable: Function from to a value • Categorical / Ordinal / Continuous • Val(X) – set of possible values of RV X • Upper case letters denote RVs (e.g., X, Y, Z) • Upper case bold letters denote set of RVs (e.g., X, Y) • Lower case letters denote RV values (e.g., x, y, z) • Lower case bold letters denote RV set values (e.g., x) • Values for categorical RVs with |Val(X)|=k: x1,x2,…,xk • Marginal distribution over X: P(X) • Conditional independence:X is independent of Y given Z if:

  4. Independence assumption Expectation • Discrete RVs: • Continuous RVs: • Linearity of expectation: • Expectation of products:(when X Y in P)

  5. Variance • Variance of RV: • If X and Y are independent: Var[X+Y]=Var[X]+Var[Y] • Var[aX+b]=a2Var[X]

  6. Information Theory • Entropy: • We use log base 2 to interpret entropy as bits of information • Entropy of X is a lower bound on avg. # of bits to encode values of X • 0  Hp(X)  log|Val(X)| for any distribution P(X) • Conditional entropy: • Information only helps: • Mutual information: • 0  Ip(X;Y)  Hp(X) • Symmetry: Ip(X;Y)= Ip(Y;X) • Ip(X;Y)=0 iff X and Y are independent • Chain rule of entropies:

  7. Distances Between Distributions • Relative Entropy: • D(P॥Q)0 • D(P॥Q)=0 iff P=Q • Not a distance metric (no symmetry and triangle inequality) • L1 distance: • L2 distance: • L distance:

  8. Independent Random Variables • Two variables X and Y are independent if • P(X=x|Y=y) = P(X=x) for all values x,y • Equivalently, knowing Y does not change predictions of X • If X and Y are independent then: • P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) • If X1,…,Xn are independent then: • P(X1,…,Xn) = P(X1)…P(Xn) • O(n) parameters • All 2n probabilities are implicitly defined • Cannot represent many types of distributions

  9. Conditional Independence • X and Y are conditionally independent given Z if • P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z • Equivalently, if we know Z, then knowing Y does not change predictions of X • Notation: Ind(X;Y | Z) or (X  Y | Z)

  10. Conditional Parameterization • S = Score on test, Val(S) = {s0,s1} • I = Intelligence, Val(I) = {i0,i1} P(I,S) P(I) P(S|I) Joint parameterization Conditional parameterization 3 parameters 3 parameters Alternative parameterization: P(S) and P(I|S)

  11. Joint parameterization • 223=12-1=11 independent parameters • Conditional parameterization has • P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) • P(I) – 1 independent parameter • P(S|I) – 21 independent parameters • P(G|I) - 22 independent parameters • 7 independent parameters Conditional Parameterization • S = Score on test, Val(S) = {s0,s1} • I = Intelligence, Val(I) = {i0,i1} • G = Grade, Val(G) = {g0,g1,g2} • Assume that G and S are independent given I

  12. Biased Coin Toss Example • Coin can land in two positions: Head or Tail • Estimation task • Given toss examples x[1],...x[m] estimateP(H)= and P(T)=1- • Assumption: i.i.d samples • Tosses are controlled by an (unknown) parameter  • Tosses are sampled from the same distribution • Tosses are independent of each other

  13. L(D:)  0 0.2 0.4 0.6 0.8 1 Biased Coin Toss Example • Goal: find [0,1] that predicts the data well • “Predicts the data well” = likelihood of the data given  • Example: probability of sequence H,T,T,H,H

  14. L(D:)  0 0.2 0.4 0.6 0.8 1 Maximum Likelihood Estimator • Parameter  that maximizes L(D:) • In our example, =0.6 maximizes the sequence H,T,T,H,H

  15. Maximum Likelihood Estimator • General case • Observations: MH heads and MT tails • Find  maximizing likelihood • Equivalent to maximizing log-likelihood • Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is:

  16. Sufficient Statistics • For computing the parameter  of the coin toss example, we only needed MH and MT since •  MH and MT are sufficient statistics

  17. Datasets Statistics Sufficient Statistics • A function s(D) is a sufficient statistics from instances to a vector in k if for any two datasets D and D’ and any  we have

  18. Sufficient Statistics for Multinomial • A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts <M1,...Mk> such that Mi is the number of times that the Y=yi in D • Sufficient statistic • Define s(x[i]) as a tuple of dimension k • s(x[i])=(0,...0,1,0,...,0) (1,...,i-1) (i+1,...,k)

  19. Sufficient Statistic for Gaussian • Gaussian distribution: • Rewrite as •  sufficient statistics for Gaussian: s(x[i])=<1,x,x2>

  20. Maximum Likelihood Estimation • MLE Principle: Choose  that maximize L(D:) • Multinomial MLE: • Gaussian MLE:

  21. MLE for Bayesian Networks • Parameters • x0, x1 • y0|x0, y1|x0, y0|x1, y1|x1 • Data instance tuple: <x[m],y[m]> • Likelihood X Y  Likelihood decomposes into two separate terms, one for each variable

  22. MLE for Bayesian Networks • Terms further decompose by CPDs: • By sufficient statistics where M[x0,y0] is the number of data instances in which X takes the value x0 and Y takes the value y0 • MLE

  23. MLE for Bayesian Networks • Likelihood for Bayesian network •  if Xi|Pa(Xi) are disjoint then MLE can be computed by maximizing each local likelihood separately

  24. MLE for Table CPD BayesNets • Multinomial CPD • For each value xX we get an independent multinomial problem where the MLE is

  25. Limitations of MLE • Two teams play 10 times, and the first wins 7 of the 10 matches  Probability of first team winning = 0.7 • A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses  Probability of head = 0.7 • Would you place the same bet on the next game as you would on the next coin toss? • We need to incorporate prior knowledge • Prior knowledge should only be used as a guide

  26. … X[2] X[M] X[1] Bayesian Inference • Assumptions • Given a fixed  tosses are independent • If  is unknown tosses are not marginally independent – each toss tells us something about  • The following network captures our assumptions

  27. Bayesian Inference • Joint probabilistic model • Posterior probability over   … X[2] X[M] X[1] Likelihood Prior For a uniform prior, posterior is the normalized likelihood Normalizing factor

  28. Bayesian Prediction • Predict the data instance from the previous ones • Solve for uniform prior P()=1 and binomial variable

  29. (MH,MT ) = (4,1) 0 0.2 0.4 0.6 0.8 1 Example: Binomial Data • Prior: uniform for  in [0,1] • P() = 1 •  P(|D) is proportional to the likelihood L(D:) • MLE for P(X=H) is 4/5 = 0.8 • Bayesian prediction is 5/7

  30. Dirichlet Priors • A Dirichlet prior is specified by a set of (non-negative) hyperparameters 1,...k so that  ~ Dirichlet(1,...k) if • where and • Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment

  31. Dirichlet Priors – Example 5 Dirichlet(1,1) Dirichlet(2,2) 4.5 Dirichlet(0.5,0.5) Dirichlet(5,5) 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1

  32. Dirichlet Priors • Dirichlet priors have the property that the posterior is also Dirichlet • Data counts are M1,...,Mk • Prior is Dir(1,...k) •  Posterior is Dir(1+M1,...k+Mk) • The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience • Equivalent sample size = 1+…+K • The larger the equivalent sample size the more confident we are in our prior

  33. 0.55 0.6 Different strength H + T Fixed ratio H / T 0.5 Fixed strength H + T Different ratio H / T 0.5 0.45 0.4 0.4 0.35 0.3 0.3 0.2 0.25 0.1 0.2 0.15 0 0 20 40 60 80 100 0 20 40 60 80 100 Effect of Priors • Prediction of P(X=H) after seeing data with MH=1/4MT as a function of the sample size

  34. MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) Effect of Priors (cont.) • In real data, Bayesian estimates are less sensitive to noise in the data 0.7 0.6 0.5 P(X = 1|D) 0.4 0.3 0.2 N 0.1 5 10 15 20 25 30 35 40 45 50 1 Toss Result 0 N

  35. General Formulation • Joint distribution over D, • Posterior distribution over parameters • P(D) is the marginal likelihood of the data • As we saw, likelihood can be described compactly using sufficient statistics • We want conditions in which posterior is also compact

  36. Conjugate Families • A family of priors P(:) is conjugate to a model P(|) if for any possible dataset D of i.i.d samples from P(|) and choice of hyperparameters  for the prior over , there are hyperparameters ’ that describe the posterior, i.e.,P(:’) P(D|)P(:) • Posterior has the same parametric form as the prior • Dirichlet prior is a conjugate family for the multinomial likelihood • Conjugate families are useful since: • Many distributions can be represented with hyperparameters • They allow for sequential update within the same representation • In many cases we have closed-form solutions for prediction

  37. Parameter Estimation Summary • Estimation relies on sufficient statistics • For multinomials these are of the form M[xi,pai] • Parameter estimation • Bayesian methods also require choice of priors • MLE and Bayesian are asymptotically equivalent • Both can be implemented in an online manner by accumulating sufficient statistics Bayesian (Dirichlet) MLE

  38. This Week’s Assignment • Compute P(S) • Decompose as a Markov model of order k • Collect sufficient statistics • Use ratio to genome background • Evaluation & Deliverable • Test set likelihood ratio to random locations & sequences • ROC analysis (ranking)

  39. Hidden Markov Model • Special case of Dynamic Bayesian network • Single (hidden) state variable • Single (observed) observation variable • Transition probability P(S’|S) assumed to be sparse • Usually encoded by a state transition graph S0 S0 S1 S2 S S’ S3 O0 O1 O2 O’ O3 G G0 Unrolled network

  40. Hidden Markov Model • Special case of Dynamic Bayesian network • Single (hidden) state variable • Single (observed) observation variable • Transition probability P(S’|S) assumed to be sparse • Usually encoded by a state transition graph P(S’|S) S1 S2 S3 S4 State transition representation

More Related