Loading in 5 sec....

CS b553 : A lgorithms for Optimization and LearningPowerPoint Presentation

CS b553 : A lgorithms for Optimization and Learning

- 107 Views
- Uploaded on
- Presentation posted in: General

CS b553 : A lgorithms for Optimization and Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CS b553: Algorithms for Optimization and Learning

Bayesian Networks

- Bayesian networks
- Chain rule for Bayes nets
- Naïve Bayes models

- Independence declarations
- D-separation

- Probabilistic inference queries

- Efficient and intuitive modeling of complex causal interactions
- Compact representation of joint distributions O(n) rather than O(2n)
- Algorithms for efficient inference with given evidence (more on this next time)

- Two random variables a and b are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A)
- Knowing b doesn’t give you any information about a
- [This equality has to hold for all combinations of values that Aand B can take on, i.e., all events A=a and B=b are independent]

- If A and B are independent, then P(A,B) = P(A) P(B)
- => The joint distribution over A and B can be defined as a product over the distribution of Aand the distribution of B
- => Store two much smaller probability tables rather than a large probability table over all combinations of Aand B

- Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C)hence P(A|B,C) = P(A|C)
- Once you know C, learning Bdoesn’t give you any information about A
- [again, this has to hold for all combinations of values that A,B,C can take on]

- Consider Grade(CS101), Intelligence, and SAT
- Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores
- but good students are more likely to get good SAT scores, so they are not independent…
- It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence

- Explicitly represent independence among propositions
- Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly

P(I,G,S) = P(G,S|I) P(I)

= P(G|I) P(S|I) P(I)

Intel.

Grade

SAT

6probabilities, instead of 11

- Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn)
- Each node has a set of parents PaX
- Graph must be a DAG

- Each node also maintains a conditional probability distribution (often, a table)
- P(X|PaX)
- 2k-1entries for binary valued variables

- Overall: O(n2k) storage for binary variables
- Encodes the joint probability over X1,…,Xn

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

P(jmabe) = ??

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

- P(jmabe)= P(jm|a,b,e) P(abe)= P(j|a,b,e) P(m|a,b,e) P(abe)(J and M are independent given A)
- P(j|a,b,e) = P(j|a)(J and Band Jand E are independent given A)
- P(m|a,b,e) = P(m|a)
- P(abe) = P(a|b,e) P(b|e) P(e) = P(a|b,e) P(b) P(e)(B and Eare independent)
- P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)

Burglary

Earthquake

alarm

JohnCalls

MaryCalls

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

Burglary

Earthquake

alarm

P(x1x2…xn) = Pi=1,…,nP(xi|paXi)

johnCalls

maryCalls

full joint distribution

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

- Joint distribution is a product of all CPTs
- P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)

- P(Cause,Effect1,…,Effectn)= P(Cause) PiP(Effecti| Cause)

Cause

Effect1

Effect2

Effectn

- More manageable # of parameters to set and store
- Incremental modeling
- Explicit encoding of independence assumptions
- Efficient inference techniques

A

C

C

B

B

B

C

A

A

2 BN’s with the same expressive power, and a 3rd with greater power (exercise)

- Given B, does the value of A affect the probability of C?
- P(C|B,A) = P(C|B)?

- No!
- C parent’s (B) are given, and so it is independent of its non-descendents (A)
- Independence is symmetric:C A | B => A C | B

A

B

C

- A node is independent of its non-descendants given its parents (and given nothing else)

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Burglary Earthquake

JohnCallsMaryCalls | Alarm

JohnCalls Burglary | Alarm

JohnCalls Earthquake | Alarm

MaryCalls Burglary | Alarm

MaryCalls Earthquake | Alarm

A node is independent of its non-descendents, given its parents

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

- How about Burglary Earthquake | Alarm ?
- No! Why?

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

- How about Burglary Earthquake | Alarm ?
- No! Why?
- P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075
- P(B|A)P(E|A) = 0.086

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

- How about Burglary Earthquake | JohnCalls?
- No! Why?
- Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent

- For polytrees, there exists a unique undirected path between A and B. For each node on the path:
- Evidence on the directed road XEY or XEY makes X and Y independent
- Evidence on an XEY makes descendants independent
- Evidence on a “V” node, or below the V:XEY, or XWY with W…Emakes the X and Y dependent(otherwise they are independent)

- Formal property in general case:
- D-separation : the above properties hold for all (acyclic) paths between A and B
- D-separation independence

- That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation
- The CPTs may indeed encode additional independences

- Given: some probabilistic model over variables X
- Find: distribution over YX given evidence E=e for some subset E X / Y
- P(Y|E=e)

- Inference problem

- Easiest case: Y=X/E
- P(Y|E=e) = P(Y,e)/P(e)
- Denominator makes the probabilities sum to 1
- Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e)

- Otherwise, let Z=X/(EY)
- P(Y|E=e) = Sz P(Y,Z=z,e) /P(e)
- P(e) = SySz P(Y=y,Z=z,e)

- Inference with joint distribution: O(2|X/E|) for binary variables

P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn)

= 1/Z P(C)Pi P(Fi|C)

Given features, what class?

- P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Spam / Not Spam

English / French / Latin

…

Class

Feature1

Feature2

Featuren

Word occurrences

- P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Given some features, what is the distribution over class?

P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk)

= 1/Z Sfk+1…fnP(C,F1,….,Fk,fk+1,…fn)

= 1/Z P(C)Sfk+1…fnPi=1…kP(Fi|C)Pj=k+1…n P(fj|C)

= 1/Z P(C)Pi=1…kP(Fi|C)Pj=k+1…nSfjP(fj|C)

= 1/Z P(C)Pi=1…k P(Fi|C)

- For BNs and queries in general, it’s not that simple… more in later lectures.
- Next class: skim 5.1-3, begin reading 9.1-4