Cs b553 a lgorithms for optimization and learning
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

CS b553 : A lgorithms for Optimization and Learning PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

CS b553 : A lgorithms for Optimization and Learning. Bayesian Networks. agenda. B ayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations D-separation Probabilistic inference queries. Purposes of bayesian Networks.

Download Presentation

CS b553 : A lgorithms for Optimization and Learning

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs b553 a lgorithms for optimization and learning

CS b553: Algorithms for Optimization and Learning

Bayesian Networks


Agenda

agenda

  • Bayesian networks

    • Chain rule for Bayes nets

    • Naïve Bayes models

  • Independence declarations

    • D-separation

  • Probabilistic inference queries


Purposes of bayesian networks

Purposes of bayesianNetworks

  • Efficient and intuitive modeling of complex causal interactions

  • Compact representation of joint distributions O(n) rather than O(2n)

  • Algorithms for efficient inference with given evidence (more on this next time)


Independence of random variables

Independence of random variables

  • Two random variables a and b are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A)

  • Knowing b doesn’t give you any information about a

  • [This equality has to hold for all combinations of values that Aand B can take on, i.e., all events A=a and B=b are independent]


Significance of independence

Significance of independence

  • If A and B are independent, then P(A,B) = P(A) P(B)

  • => The joint distribution over A and B can be defined as a product over the distribution of Aand the distribution of B

  • => Store two much smaller probability tables rather than a large probability table over all combinations of Aand B


Conditional independence

Conditional Independence

  • Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C)hence P(A|B,C) = P(A|C)

  • Once you know C, learning Bdoesn’t give you any information about A

  • [again, this has to hold for all combinations of values that A,B,C can take on]


Significance of conditional independence

Significance of Conditional independence

  • Consider Grade(CS101), Intelligence, and SAT

  • Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores

  • but good students are more likely to get good SAT scores, so they are not independent…

  • It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence


Bayesian network

bayesianNetwork

  • Explicitly represent independence among propositions

  • Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly

P(I,G,S) = P(G,S|I) P(I)

= P(G|I) P(S|I) P(I)

Intel.

Grade

SAT

6probabilities, instead of 11


Definition bayesian network

Definition: bayesian network

  • Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn)

  • Each node has a set of parents PaX

    • Graph must be a DAG

  • Each node also maintains a conditional probability distribution (often, a table)

    • P(X|PaX)

    • 2k-1entries for binary valued variables

  • Overall: O(n2k) storage for binary variables

  • Encodes the joint probability over X1,…,Xn


Calculation of joint probability

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Calculation of joint Probability

P(jmabe) = ??


Cs b553 a lgorithms for optimization and learning

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

  • P(jmabe)= P(jm|a,b,e) P(abe)= P(j|a,b,e)  P(m|a,b,e)  P(abe)(J and M are independent given A)

  • P(j|a,b,e) = P(j|a)(J and Band Jand E are independent given A)

  • P(m|a,b,e) = P(m|a)

  • P(abe) = P(a|b,e)  P(b|e)  P(e) = P(a|b,e)  P(b)  P(e)(B and Eare independent)

  • P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)


Calculation of joint probability1

Burglary

Earthquake

alarm

JohnCalls

MaryCalls

Calculation of joint Probability

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062


Calculation of joint probability2

Burglary

Earthquake

alarm

P(x1x2…xn) = Pi=1,…,nP(xi|paXi)

johnCalls

maryCalls

 full joint distribution

Calculation of joint Probability

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062


Chain rule for bayes nets

Chain Rule for Bayes Nets

  • Joint distribution is a product of all CPTs

  • P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)


Example na ve bayes models

Example: Naïve bayes models

  • P(Cause,Effect1,…,Effectn)= P(Cause) PiP(Effecti| Cause)

Cause

Effect1

Effect2

Effectn


Advantages of bayes nets and other graphical models

Advantages of Bayes Nets (and other graphical models)

  • More manageable # of parameters to set and store

  • Incremental modeling

  • Explicit encoding of independence assumptions

  • Efficient inference techniques


Arcs do not necessarily encode causality

Arcs do not necessarily encode causality

A

C

C

B

B

B

C

A

A

2 BN’s with the same expressive power, and a 3rd with greater power (exercise)


Reading off independence relationships

Reading off independence relationships

  • Given B, does the value of A affect the probability of C?

    • P(C|B,A) = P(C|B)?

  • No!

  • C parent’s (B) are given, and so it is independent of its non-descendents (A)

  • Independence is symmetric:C  A | B => A  C | B

A

B

C


Basic rule

Basic Rule

  • A node is independent of its non-descendants given its parents (and given nothing else)


What does the bn encode

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

What does the BN encode?

Burglary  Earthquake

JohnCallsMaryCalls | Alarm

JohnCalls Burglary | Alarm

JohnCalls Earthquake | Alarm

MaryCalls Burglary | Alarm

MaryCalls Earthquake | Alarm

A node is independent of its non-descendents, given its parents


Reading off independence relationships1

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Reading off independence relationships

  • How about Burglary Earthquake | Alarm ?

  • No! Why?


Reading off independence relationships2

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Reading off independence relationships

  • How about Burglary  Earthquake | Alarm ?

  • No! Why?

  • P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075

  • P(B|A)P(E|A) = 0.086


Reading off independence relationships3

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Reading off independence relationships

  • How about Burglary  Earthquake | JohnCalls?

  • No! Why?

  • Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent


Independence relationships

Independence relationships

  • For polytrees, there exists a unique undirected path between A and B. For each node on the path:

    • Evidence on the directed road XEY or XEY makes X and Y independent

    • Evidence on an XEY makes descendants independent

    • Evidence on a “V” node, or below the V:XEY, or XWY with W…Emakes the X and Y dependent(otherwise they are independent)


General case

General case

  • Formal property in general case:

    • D-separation : the above properties hold for all (acyclic) paths between A and B

    • D-separation  independence

  • That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation

    • The CPTs may indeed encode additional independences


Probability queries

Probability Queries

  • Given: some probabilistic model over variables X

  • Find: distribution over YX given evidence E=e for some subset E X / Y

    • P(Y|E=e)

  • Inference problem


Answering inference problems with the joint distribution

Answering Inference Problems with the Joint Distribution

  • Easiest case: Y=X/E

    • P(Y|E=e) = P(Y,e)/P(e)

    • Denominator makes the probabilities sum to 1

    • Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e)

  • Otherwise, let Z=X/(EY)

    • P(Y|E=e) = Sz P(Y,Z=z,e) /P(e)

    • P(e) = SySz P(Y=y,Z=z,e)

  • Inference with joint distribution: O(2|X/E|) for binary variables


Na ve bayes classifier

P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn)

= 1/Z P(C)Pi P(Fi|C)

Given features, what class?

Naïve bayesClassifier

  • P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Spam / Not Spam

English / French / Latin

Class

Feature1

Feature2

Featuren

Word occurrences


Na ve bayes classifier1

Naïve bayesClassifier

  • P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Given some features, what is the distribution over class?

P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk)

= 1/Z Sfk+1…fnP(C,F1,….,Fk,fk+1,…fn)

= 1/Z P(C)Sfk+1…fnPi=1…kP(Fi|C)Pj=k+1…n P(fj|C)

= 1/Z P(C)Pi=1…kP(Fi|C)Pj=k+1…nSfjP(fj|C)

= 1/Z P(C)Pi=1…k P(Fi|C)


For general queries

For General Queries

  • For BNs and queries in general, it’s not that simple… more in later lectures.

  • Next class: skim 5.1-3, begin reading 9.1-4


  • Login