1 / 30

# CS b553 : A lgorithms for Optimization and Learning - PowerPoint PPT Presentation

CS b553 : A lgorithms for Optimization and Learning. Bayesian Networks. agenda. B ayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations D-separation Probabilistic inference queries. Purposes of bayesian Networks.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' CS b553 : A lgorithms for Optimization and Learning' - lluvia

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CS b553: Algorithms for Optimization and Learning

Bayesian Networks

• Bayesian networks

• Chain rule for Bayes nets

• Naïve Bayes models

• Independence declarations

• D-separation

• Probabilistic inference queries

Purposes of bayesianNetworks

• Efficient and intuitive modeling of complex causal interactions

• Compact representation of joint distributions O(n) rather than O(2n)

• Algorithms for efficient inference with given evidence (more on this next time)

• Two random variables a and b are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A)

• Knowing b doesn’t give you any information about a

• [This equality has to hold for all combinations of values that Aand B can take on, i.e., all events A=a and B=b are independent]

• If A and B are independent, then P(A,B) = P(A) P(B)

• => The joint distribution over A and B can be defined as a product over the distribution of Aand the distribution of B

• => Store two much smaller probability tables rather than a large probability table over all combinations of Aand B

• Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C)hence P(A|B,C) = P(A|C)

• Once you know C, learning Bdoesn’t give you any information about A

• [again, this has to hold for all combinations of values that A,B,C can take on]

• Consider Grade(CS101), Intelligence, and SAT

• Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores

• but good students are more likely to get good SAT scores, so they are not independent…

• It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence

bayesianNetwork

• Explicitly represent independence among propositions

• Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly

P(I,G,S) = P(G,S|I) P(I)

= P(G|I) P(S|I) P(I)

Intel.

SAT

Definition: bayesian network

• Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn)

• Each node has a set of parents PaX

• Graph must be a DAG

• Each node also maintains a conditional probability distribution (often, a table)

• P(X|PaX)

• 2k-1entries for binary valued variables

• Overall: O(n2k) storage for binary variables

• Encodes the joint probability over X1,…,Xn

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Calculation of joint Probability

P(jmabe) = ??

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

• P(jmabe)= P(jm|a,b,e) P(abe)= P(j|a,b,e)  P(m|a,b,e)  P(abe)(J and M are independent given A)

• P(j|a,b,e) = P(j|a)(J and Band Jand E are independent given A)

• P(m|a,b,e) = P(m|a)

• P(abe) = P(a|b,e)  P(b|e)  P(e) = P(a|b,e)  P(b)  P(e)(B and Eare independent)

• P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)

Burglary

Earthquake

alarm

JohnCalls

MaryCalls

Calculation of joint Probability

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

Burglary

Earthquake

alarm

P(x1x2…xn) = Pi=1,…,nP(xi|paXi)

johnCalls

maryCalls

 full joint distribution

Calculation of joint Probability

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

• Joint distribution is a product of all CPTs

• P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)

Example: Naïve bayes models

• P(Cause,Effect1,…,Effectn)= P(Cause) PiP(Effecti| Cause)

Cause

Effect1

Effect2

Effectn

• More manageable # of parameters to set and store

• Incremental modeling

• Explicit encoding of independence assumptions

• Efficient inference techniques

Arcs do not necessarily encode causality

A

C

C

B

B

B

C

A

A

2 BN’s with the same expressive power, and a 3rd with greater power (exercise)

• Given B, does the value of A affect the probability of C?

• P(C|B,A) = P(C|B)?

• No!

• C parent’s (B) are given, and so it is independent of its non-descendents (A)

• Independence is symmetric:C  A | B => A  C | B

A

B

C

• A node is independent of its non-descendants given its parents (and given nothing else)

Earthquake

Alarm

JohnCalls

MaryCalls

What does the BN encode?

Burglary  Earthquake

JohnCallsMaryCalls | Alarm

JohnCalls Burglary | Alarm

JohnCalls Earthquake | Alarm

MaryCalls Burglary | Alarm

MaryCalls Earthquake | Alarm

A node is independent of its non-descendents, given its parents

Earthquake

Alarm

JohnCalls

MaryCalls

• How about Burglary Earthquake | Alarm ?

• No! Why?

Earthquake

Alarm

JohnCalls

MaryCalls

• How about Burglary  Earthquake | Alarm ?

• No! Why?

• P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075

• P(B|A)P(E|A) = 0.086

Earthquake

Alarm

JohnCalls

MaryCalls

• How about Burglary  Earthquake | JohnCalls?

• No! Why?

• Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent

• For polytrees, there exists a unique undirected path between A and B. For each node on the path:

• Evidence on the directed road XEY or XEY makes X and Y independent

• Evidence on an XEY makes descendants independent

• Evidence on a “V” node, or below the V: XEY, or XWY with W…Emakes the X and Y dependent(otherwise they are independent)

• Formal property in general case:

• D-separation : the above properties hold for all (acyclic) paths between A and B

• D-separation  independence

• That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation

• The CPTs may indeed encode additional independences

• Given: some probabilistic model over variables X

• Find: distribution over YX given evidence E=e for some subset E X / Y

• P(Y|E=e)

• Inference problem

• Easiest case: Y=X/E

• P(Y|E=e) = P(Y,e)/P(e)

• Denominator makes the probabilities sum to 1

• Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e)

• Otherwise, let Z=X/(EY)

• P(Y|E=e) = Sz P(Y,Z=z,e) /P(e)

• P(e) = SySz P(Y=y,Z=z,e)

• Inference with joint distribution: O(2|X/E|) for binary variables

P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn)

= 1/Z P(C)Pi P(Fi|C)

Given features, what class?

Naïve bayesClassifier

• P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Spam / Not Spam

English / French / Latin

Class

Feature1

Feature2

Featuren

Word occurrences

Naïve bayesClassifier

• P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Given some features, what is the distribution over class?

P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk)

= 1/Z Sfk+1…fnP(C,F1,….,Fk,fk+1,…fn)

= 1/Z P(C)Sfk+1…fnPi=1…kP(Fi|C)Pj=k+1…n P(fj|C)

= 1/Z P(C)Pi=1…kP(Fi|C)Pj=k+1…nSfjP(fj|C)

= 1/Z P(C)Pi=1…k P(Fi|C)

For General Queries

• For BNs and queries in general, it’s not that simple… more in later lectures.

• Next class: skim 5.1-3, begin reading 9.1-4