Bayesian Networks

Bayesian Networks

Overview • Introduction to Bayesian Networks • Inference • Learning Parameters • Learning Topology • Decision Making

Bayesian networks:AN INTRODUCTION

Overview • A Bayesian Network • Instantiation • Probability Flows • Bayesian Networks and Causality • Edges and Conditional Independence

Bayesian Networks: Introduction A Bayesian network consists of two parts: • The qualitative, which is given in the form of a directed acyclic graph (DAG). • Each node of the graph represents a variable of the system the Bayesian Network is modeling. • The edges of the graph represent independency relations between the variables (more on this later) • The qualitative, which is given by probability distributions associated with each variable, which is to say with each node in the graph. (From now on I will talk interchangeably about nodes and variables). • These distributions give the probability that the associated variable takes a particular value given the values of its parent nodes in the graph.

B A C D E F

Instantiation • We will talk about nodes being ‘instantiates’ when we know that they have a particular value, and uninstantiated when we do not. • Let’s look at what occurs when a node is instantiated…

Probability Flows • Downwards: Pr(D=1) =(.7)(.3)+(.3)(.9) =.21+.27 =.48 A D

Probability Flows • Downwards: Let A=1. Pr(D=1) =.3 A D

Probability Flows • Upwards: Pr(A=1) =.7 A D

Probability Flows • Upwards: Let D=1. Pr(A=1) =(.7)(.3)/((.7)(.3)+(.3)(.9)) =.21/(.21+.27) =.4375 A D

Probability Flows • Sideways! • A Priori: Pr(B=1) =.2 • If C is instantiated and E is not, this is unchanged: Let C=0 Pr(B=1) =.2 B C E

Probability Flows • Sideways! • But if E is instantiated, then knowing the value of C affects our knowledge of B… B C E

Probability Flows • Sideways! • Let E=1 Pr(B=1) =(.2)(.6)(.9)+(.2)(.4)(.8)/ ((.2)(.6)(.9)+(.2)(.4)(.8)+ (.8)(.6)(.6)+(.8)(.4)(.1)) =.172/.588 =.293 B C E

Probability Flows • Sideways! • Let E=1,C=0 Pr(B=1) =(.2)(.4)(.8) / (.2)(.4)(.8)+(.8)(.4)(.1) =.064/.096 =.666 B C E

Probability Flows • What is going on? • Sideways inference is akin to ‘explaining away’ Hypothesis 1: Mr N.N. suffered a stroke. Hypothesis 2: Mr N.N. had a heart attack. Event: Mr N.N. died.

Bayesian Networks and Causality • Soedges represent causation? NO!!! Bayesian Networks are not (in general) causal maps. • When creating a BN from expert knowledge, the network is often constructed from known causal connections, since humans tend to think in terms of causal relations. • When learning a BN from data we cannot assume that an edge represents a causal relationship. • There have been controversial methodologies suggested for reading causal relationships of non-causal Bayesian Networks. CfNeopolitan

Edges and Conditional Independence • We said ‘The graph represents independency relations between the variables.’ • It does so through the edges, or, more accurately, through the ABSENCE of edges, between nodes. • Recall that two variables, A and B, are independent if: P(A,B)=P(A).P(B) • And they are conditionally independent of a variable C if: P(A,B|C)=P(A|C).P(B|C)

Bayesian networks:Technicalities and definitions

Overview • The Markov Condition • D-Separation • The Markov Blanket • Markov Equivalence

The Markov Condition The Markov Condition: A node in a Bayesian Network is conditionally independent of its non-descendents given its parents. Be careful: • Does the sideways flow of probabilities clash with what you think the Markov Condition claims? • A node is NOT conditionally independent of its non-descendents given its parents AND its descendents!

The Markov Condition • We can think of a Bayesian Network as ‘pulling apart’ a Joint Probability Distribution by its conditional independencies, and thereby rendering it tractable. • This permits us to use probability theory to reason about systems in a tractable way. • Imagine each variable in our six node network can take ten values. The conditional probability tables would then have a total of 11,130 values. (10,000 of which would be for node F). • The full joint probability table would have 1,000,000 values. BUT THEY REPRESENT THE SAME DISTRIBUTION

The Markov Condition A BN can do this only because it meets the Markov Condition. In fact, meeting this conditional is the formal definition of a Bayesian Network: A DAG G, and a Probability distribution P is a Bayesian Network if and only if the pair <G,P> together satisfy the Markov Condition. The Markov Condition also entails further conditional independencies…

D-Separation Some definitions: Where we have a set of nodes {X1,X2, . . . .,Xk}, where k ≥ 2, such that Xi -> Xi-1 or Xi-1 -> Xi, for 2 ≤ i ≤ k, we call the set of edges connecting these nodes a chain between X1 and Xk. Let the head of an edge be the side next to the child (where the arrow is on our graphs) and the tail be the side next to the parent. We will talk of the edges of a chain meeting at a node on the chain.

D-Separation Some definitions: Let A be a set of nodes, and X and Y be distinct nodes not in A, and c be a chain between X and Y. Then c is blocked by A if one of the following holds: • There is a node Z ∈ A on c, and the edges that meet at Z on c meet head-to-tail. • There is a node Z ∈ A on c, and the edges that meet at Z on c meet tail-to-tail. • There is a node Z on c, such that Z and all of Z’s descendents are not in A, and the edges that meet at Z on c meet head-to-head.

D-Separation D-Separation Let A be a set of nodes, and X and Y be distinct nodes not in A. X and Y are d-separated by A if and only if every chain between X and Y is blocked by A. (This can be generalized: Let G = (V, E) be a DAG, and A, B, and C be mutually disjoint subsets of V. We say A and B are d-separated by C in G if and only if, for every X ∈ A and Y ∈ B, X and Y are d-separated by C.)

D-Separation • The Markov condition entails that all d-separations are conditional independencies; • Every conditional independencies entailed by the Markov condition is identified by a d-separation.

The Markov Blanket The Markov Blanket A node is conditionally independent of every other node in the graph given its parents, its children, and the other parents of its children. These form the Markov Blanket of the node. • Why the other parents of the nodes children? • Because of the sideways flow of probabilities.

The Markov Blanket

Markov Equivalence A definition: If edges from nodes A and B meet at a node C, we say that this meeting is coupled if and only there is also an edge between nodes A and B. Otherwise this meeting is uncoupled.

Markov Equivalence Markov Equivalence Two DAGs are Markov Equivalent: ⇔ They encode the same conditional independencies; ⇔They entail the same d-separations; ⇔They have the same links (edges without regard for direction) and the same set of uncoupled head-to-head meetings.

Inference 1:The Variable Elimination Algorithm

Potentials We define a potential to be a function mapping value combinations of a set of variables to the non-negative real numbers. For example f is a potential: f (A,D) = Notice this is the conditional probability table of node D. • Conditional Probability tables are potentials. • Joint Probability tables are potentials.

Potentials • A potential’s ‘input’ variables (those variables which can take more than one value) are called its scheme; here Scheme(f) = {A,D}. • So, if we know D is true, then the potential corresponding to our a posteriori knowledge from from D’s conditional probability table would be: g (A) = Scheme(g) = {A}

Multiplication of Potentials We next define two operations on potentials: Multiplication and Marginalization. Given a set of potentials, F, the multiplication of these potentials is itself a potential. The value of each row, r, of a potential formed in this manner is obtained from the product of the row of each function f in F which assigns the variables of Scheme(f) the same values as they are assigned in r: Obviously: If f and g are potentials, when we perform the calculation/assignment f=f.g (possible only when Scheme(g) ⊆ Scheme(f)) we will say we multiply g into f.

Multiplication of Potentials f (A,D) = If h = f.g then: h(A,D,X) = g(A,X) =

Marginalization of Potentials Given a potential f, we can marginalize out a set of variables from f and the result is itself a potential. If iis such a potential, then: Each row, r, in iis computed by summing the rows of f where the variables in Scheme(f) have the values assigned to them by r. If we wish to assign to a function variable the result of marginalizing out one of its variables, we will simply say we marginalize out from the function variable. So if g is the potential that results from marginalizing out the variable D from our example potential, then: g(A)=

The Variable Elimination Algorithm • Perform a Topological Sort (Ancestral Ordering) on the Graph. This will provide us with an ordering where no ancestor is before any of its descendants. This is always possible since the graph is a DAG. • Construct a set of 'buckets', where there is one bucket associated with each variable in the Network, b(i), and one additional bucket b∅. Each bucket will hold a set of potentials (or constant functions in the case of b∅ ). The buckets are ordered according to the topological order obtained in step one, with b∅ at the beginning. • Convert the conditional probability tables of the network into potentials and place them in the bucket associated with the largest variable in their scheme, based on the ordering. If there are no variables in a potential's scheme, it is placed in the null bucket. • Proceed in reverse order through the buckets: • Multiply all potentials in the bucket, producing a new potential, p. • Marginalize out the variable associated from the bucket from p, producing the potential p`. • Place p' in the bucket associated with the largest variable in its scheme. • Process the null bucket, which involves simply joining constant functions and is simply scalar multiplication.

The Variable Elimination Algorithm

Points to note The algorithm produces the probability of the evidence. So if it is run without any evidence, it simply marginalizes all variables out and returns 1! To actually get, say, the A Priori probabilities of each variable, we have to run the algorithm repeatedly, assigning each value to each variable (one at a time). Likewise to get A Posteriori probabilities, we run the algorithm on the evidence, then with the evidence and the Variable-Value we are interested in, and divide the second result by the first.

Conclusion The Variable Elimination Algorithm is… • VERY INEFFICIENT!!! • (When supplemented) the only algorithm that can estimate error bars for arbitrary nodes in a Bayesian Network. • once we have completed the algorithm as given, we proceed backwards calculating the derivatives of the functions involved. From these we can produce an effective approximation of the variance of the probability distribution, from which we estimate the error bars.

Inference 2:The Junction tree Algorithm

Junction Trees • A Junction Tree is a secondary structure that we construct from a Bayesian Network. • Take copy of DAG and undirect edges • Moralize • Triangulate • Find cliques of triangulated graph. • Insert sepsets between cliques. Performed in a single step

Build an optimal Junction Tree • Begin with a set of n trees, each consisting of a single clique, and an empty set S. • For each distinct pair of cliques X and Y, insert a candidate sepset into S, containing all and only nodes in both X and Y. • Repeat until n-1 sepsets have been inserted into the forest. • Choose the candidate sepset, C, which contains the largest number of nodes, breaking ties by choosing the sepset which has the smaller value product (the product of the number of values of the nodes/variables in the sepset). • Delete C from S. • Insert C between the cliques X and Y only if X and Y are on different trees in the forest. (NB This merges the two trees into a larger tree.)

Our DAG B A C D E F

Undirected B A C D E F

Moralized A B D C E F

Obtain Cliques from Triangulated Graph whilst Triangulating • Take the moral graph, G1, and make a copy of it, G2. • While there are still nodes left in G2: • Select a node V from G2, such that V causes the least number of edges to be added in Step 2b, breaking ties by choosing the node that induces the cluster with the smallest weight, where: • The weight of a node V is the number of values of V . • The weight of a cluster is the product of the weights of its constituent nodes. • The node V and its neighbors in G2 form a cluster, C. Connect all of the nodes in this cluster. • If C is not a sub-graph of a previous cluster, store C. • Remove V from G2.

Cliques A , D B , D , E , F B , C , E

Bayesian Networks