Loading in 5 sec....

CS b553 : A lgorithms for Optimization and LearningPowerPoint Presentation

CS b553 : A lgorithms for Optimization and Learning

- 115 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' CS b553 : A lgorithms for Optimization and Learning' - lluvia

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### CS b553: Algorithms for Optimization and Learning

Bayesian Networks

agenda

- Bayesian networks
- Chain rule for Bayes nets
- Naïve Bayes models

- Independence declarations
- D-separation

- Probabilistic inference queries

Purposes of bayesianNetworks

- Efficient and intuitive modeling of complex causal interactions
- Compact representation of joint distributions O(n) rather than O(2n)
- Algorithms for efficient inference with given evidence (more on this next time)

Independence of random variables

- Two random variables a and b are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A)
- Knowing b doesn’t give you any information about a
- [This equality has to hold for all combinations of values that Aand B can take on, i.e., all events A=a and B=b are independent]

Significance of independence

- If A and B are independent, then P(A,B) = P(A) P(B)
- => The joint distribution over A and B can be defined as a product over the distribution of Aand the distribution of B
- => Store two much smaller probability tables rather than a large probability table over all combinations of Aand B

Conditional Independence

- Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C)hence P(A|B,C) = P(A|C)
- Once you know C, learning Bdoesn’t give you any information about A
- [again, this has to hold for all combinations of values that A,B,C can take on]

Significance of Conditional independence

- Consider Grade(CS101), Intelligence, and SAT
- Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores
- but good students are more likely to get good SAT scores, so they are not independent…
- It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence

bayesianNetwork

- Explicitly represent independence among propositions
- Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly

P(I,G,S) = P(G,S|I) P(I)

= P(G|I) P(S|I) P(I)

Intel.

Grade

SAT

6probabilities, instead of 11

Definition: bayesian network

- Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn)
- Each node has a set of parents PaX
- Graph must be a DAG

- Each node also maintains a conditional probability distribution (often, a table)
- P(X|PaX)
- 2k-1entries for binary valued variables

- Overall: O(n2k) storage for binary variables
- Encodes the joint probability over X1,…,Xn

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

- P(jmabe)= P(jm|a,b,e) P(abe)= P(j|a,b,e) P(m|a,b,e) P(abe)(J and M are independent given A)
- P(j|a,b,e) = P(j|a)(J and Band Jand E are independent given A)
- P(m|a,b,e) = P(m|a)
- P(abe) = P(a|b,e) P(b|e) P(e) = P(a|b,e) P(b) P(e)(B and Eare independent)
- P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)

Burglary

Earthquake

alarm

JohnCalls

MaryCalls

Calculation of joint ProbabilityP(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

Burglary

Earthquake

alarm

P(x1x2…xn) = Pi=1,…,nP(xi|paXi)

johnCalls

maryCalls

full joint distribution

Calculation of joint ProbabilityP(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

Chain Rule for Bayes Nets

- Joint distribution is a product of all CPTs
- P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)

Example: Naïve bayes models

- P(Cause,Effect1,…,Effectn)= P(Cause) PiP(Effecti| Cause)

Cause

Effect1

Effect2

Effectn

Advantages of Bayes Nets (and other graphical models)

- More manageable # of parameters to set and store
- Incremental modeling
- Explicit encoding of independence assumptions
- Efficient inference techniques

Arcs do not necessarily encode causality

A

C

C

B

B

B

C

A

A

2 BN’s with the same expressive power, and a 3rd with greater power (exercise)

Reading off independence relationships

- Given B, does the value of A affect the probability of C?
- P(C|B,A) = P(C|B)?

- No!
- C parent’s (B) are given, and so it is independent of its non-descendents (A)
- Independence is symmetric:C A | B => A C | B

A

B

C

Basic Rule

- A node is independent of its non-descendants given its parents (and given nothing else)

Earthquake

Alarm

JohnCalls

MaryCalls

What does the BN encode?Burglary Earthquake

JohnCallsMaryCalls | Alarm

JohnCalls Burglary | Alarm

JohnCalls Earthquake | Alarm

MaryCalls Burglary | Alarm

MaryCalls Earthquake | Alarm

A node is independent of its non-descendents, given its parents

Earthquake

Alarm

JohnCalls

MaryCalls

Reading off independence relationships- How about Burglary Earthquake | Alarm ?
- No! Why?

Earthquake

Alarm

JohnCalls

MaryCalls

Reading off independence relationships- How about Burglary Earthquake | Alarm ?
- No! Why?
- P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075
- P(B|A)P(E|A) = 0.086

Earthquake

Alarm

JohnCalls

MaryCalls

Reading off independence relationships- How about Burglary Earthquake | JohnCalls?
- No! Why?
- Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent

Independence relationships

- For polytrees, there exists a unique undirected path between A and B. For each node on the path:
- Evidence on the directed road XEY or XEY makes X and Y independent
- Evidence on an XEY makes descendants independent
- Evidence on a “V” node, or below the V: XEY, or XWY with W…Emakes the X and Y dependent(otherwise they are independent)

General case

- Formal property in general case:
- D-separation : the above properties hold for all (acyclic) paths between A and B
- D-separation independence

- That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation
- The CPTs may indeed encode additional independences

Probability Queries

- Given: some probabilistic model over variables X
- Find: distribution over YX given evidence E=e for some subset E X / Y
- P(Y|E=e)

- Inference problem

Answering Inference Problems with the Joint Distribution

- Easiest case: Y=X/E
- P(Y|E=e) = P(Y,e)/P(e)
- Denominator makes the probabilities sum to 1
- Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e)

- Otherwise, let Z=X/(EY)
- P(Y|E=e) = Sz P(Y,Z=z,e) /P(e)
- P(e) = SySz P(Y=y,Z=z,e)

- Inference with joint distribution: O(2|X/E|) for binary variables

P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn)

= 1/Z P(C)Pi P(Fi|C)

Given features, what class?

Naïve bayesClassifier- P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Spam / Not Spam

English / French / Latin

…

Class

Feature1

Feature2

Featuren

Word occurrences

Naïve bayesClassifier

- P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)

Given some features, what is the distribution over class?

P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk)

= 1/Z Sfk+1…fnP(C,F1,….,Fk,fk+1,…fn)

= 1/Z P(C)Sfk+1…fnPi=1…kP(Fi|C)Pj=k+1…n P(fj|C)

= 1/Z P(C)Pi=1…kP(Fi|C)Pj=k+1…nSfjP(fj|C)

= 1/Z P(C)Pi=1…k P(Fi|C)

For General Queries

- For BNs and queries in general, it’s not that simple… more in later lectures.
- Next class: skim 5.1-3, begin reading 9.1-4

Download Presentation

Connecting to Server..