Reasoning Under Uncertainty

Reasoning Under Uncertainty Artificial Intelligence Chapter 9

Part 2 Reasoning

Notation • Random variable (RV): a variable (uppercase)that takes on values (lowercase) from a domainof mutually exclusive and exhaustive values • A=a: a proposition, world state, event, effect, etc. • abbreviate: P(A=true) to P(a) • abbreviate: P(A=false) to P(Øa) • abbreviate: P(A=value) to P(value) • abbreviate: P(A¹value) to P(Øvalue) • Atomic event: a complete specification of the stateof the world about which the agent is uncertain

Notation • P(a): aprior probability of RV A=awhich is the degree of belief proposition ain absence of any other relevant information • P(a|e): conditional probability of RV A=agivenE=ewhich is the degree of belief in proposition awhen all that is known is evidence e • P(A):probability distribution, i.e. set of P(ai) for all i • Joint probabilities are for conjunctions of propositions

Reasoning under Uncertainty • Rather than reasoning about the truth or falsityof a proposition, instead reason about the belief that a proposition is true. • Use knowledge base of known probabilities to determine probabilities for query propositions.

plum rope hall plum rope kitchen plum pipe hall plum pipe kitchen green rope hall green rope kitchen green pipe hall green pipe kitchen Reasoning under Uncertaintyusing Full Joint Distributions Assume a simplified Clue game havingtwo characters, two weapons and two rooms: each row is an atomic event- one of these must be true - list must be mutually exclusive - list must be exhaustive prior probability for each is 1/8 - each equally likely - e.g. P(plum,rope,hall) = 1/8  P(atomic_eventi) = 1 since each RV's domain isexhaustive & mutually exclusive

Determining Marginal Probabilitiesusing Full Joint Distributions • The probability of any proposition is equal tothe sum of the probabilities of the atomic eventsin which it holds, which is called the set e(a). P(a) = P(ei)where eiis an element of e(a) • its the disjunction of atomic events in set e(a) • recall this property of atomic events:any proposition is logically equivalent to the disjunctionof all atomic events that entail the truth of that proposition

plum 1/8 plum 1/8 plum 1/8 plum 1/8 Determining Marginal Probabilitiesusing Full Joint Distributions Assume a simplified Clue game havingtwo characters, two weapons and two rooms: P(a)= P(ei)where eiis an element of e(a) P(plum) = ? 1/8+1/8+1/8+1/8 = 1/2 when obtained in this manner it is called a marginal probability can be just a prior probability (shown) or more complex (next) this process is called marginalization or summing out

1/8+1/8 = 1/4 1/8 1/8+1/8 = 1/4 1/8 1/8 1/8 1/8 1/8+1/8+ 1/8+1/8+1/8+1/8 = 3/4 1/8 1/8 1/8 1/8 1/8 Reasoning under Uncertaintyusing Full Joint Distributions Assume a simplified Clue game havingtwo characters, two weapons and two rooms: P(green,pipe) = P(rope, Øhall) = P(rope Ú hall) = 1/8

Independence • Using the game cluefor an example is uninteresting! Why? • Because the random variablesWho, What, Where are independent. • Does picking the murder from the deck of cards affect which weapon is chosen? Location? No! Each is randomly selected.

Independence • Unconditional (absolute) Independence:RVs have no affect on each other's probabilities 1. P(X|Y) = P(X) 2. P(Y|X) = P(Y) 3. P(X,Y) = P(X) P(Y) • Example (full clue): P(green | hall)= P(green, hall) / P(hall) = 6/324 / 1/9 = P(green) = 1/6 P(hall | green) = P(hall) = 1/9 P(green, hall) = P(green) P(hall) = 1/54 We need a more interesting example!

Independence • Conditional Independence:RVs (X, Y) are dependent on another RV (Z)but are independent of each other 1. P(X|Y,Z) = P(X|Z) 2. P(Y|X,Z) = P(Y|Z) 3. P(X,Y|Z) = P(X|Z) P(Y|Z) • Idea:sneezing (x) and itchy eyes (y)are both directly caused byhayfever (z) but neither sneezing nor itchy eyeshas a direct effect on each other

Reasoning under Uncertaintyusing Full Joint Distributions Assume three boolean RVs: Hayfever HF, Sneeze SN, ItchyEyes IE and fictional probabilities: P(a) =  P(ei)where eiis an element of e(a) P(sn)= 0.1+ 0.1+ 0.04+ 0.1=0.34 P(hf)= 0.01+ 0.06+ 0.04+ 0.1=0.21 P(sn,ie)= 0.1+ 0.1=0.20 P(hf,sn)= 0.04+ 0.1=0.14

Reasoning under Uncertaintyusing Full Joint Distributions Assume three boolean RVs: Hayfever HF, Sneeze SN, ItchyEyes IE and fictional probabilities: P(a|e) = P(a,e) / P(e) P(hf | sn)= P(hf,sn) / P(sn) = 0.14 / 0.34 = 0.41 P(hf | ie)= P(hf,ie) / P(ie) = 0.16 / 0.35 = 0.46

Reasoning under Uncertaintyusing Full Joint Distributions Assume three boolean RVs: Hayfever HF, Sneeze SN, ItchyEyes IE and fictional probabilities: P(a|e) = P(a,e) / P(e) Instead of computingP(e),could use normalization P(hf | sn)= 0.14 / P(sn) also compute:P(Øhf | sn)= 0.20 / P(sn) since P(hf | sn) + P(Øhf | sn)= 1 substituting and solving givesP(sn) = 0.34 !

Combining Multiple Evidence • As evidence describing the state of the worldis accumulated, we'd like to be able to easily update the degree of belief in a conclusion. • Using the Full Joint Prob. Dist. Table: P(v1,...,vk|vk+1,...,vn) = P(V1=v1,...,Vn=vn) / P(Vk+1=vk+1,...,Vn=vn) • sum of all entries in the table, where V1=v1, ..., Vn=vn • divided by the sum of all entries in the tablecorresponding to the evidence, where Vk+1=vk+1, ..., Vn=vn

Combining Multiple Evidenceusing Full Joint Distributions Assume three boolean RVs and fictional probabilities Hayfever HF, Sneeze SN, ItchyEyes IE: P(a|b, c) = P(a,b,c)/  P(b,c) as described in prior slide P(hf | sn, ie)= P(hf,sn,ie) /  P(sn,ie) = 0.10 / (0.1+0.1) = 0.5

Combining Multiple Evidence (cont.) • FJDT techniques are intractable in general because the table size grows exponentially. • Independence assertions can help reducethe size of the domain and the complexityof the inference problem. • Independence assertions are usually basedon the knowledge of the domain enablingFJD table to be factored in to separate JD tables. • it's a good thing that problem domains are independent • but typically subsets of dependent RVs are quite large

Bayes' Rule • Bayes' Rule:P(b|a) = (P(a|b) P(b)) / P(a) • derived from P(aÙb) = P(b|a) P(a) = P(a|b) P(b)just divide both sides of equation byP(a) • basis of AI systems using probabilistic reasoning • For Example: a=happy, b=suna= sneeze, b= fall P(sun|happy) = ? P(fall|sneeze) = ? P(happy|sun) = 0.95 P(sneeze|fall) = 0.85P(sun) = 0.5 P(fall) = 0.25P(happy) = 0.75 P(sneeze) = 0.3 (0.95 * 0.5)/0.75 = 0.63 (0.85 * 0.25)/0.3 = 0.71

Bayes' Rule P(b|a) = (P(a|b) P(b)) / P(a)What's the benefit of being able to calculateP(b|a) from the three probabilities on the right? • Usefulness of Bayes' Rule: • many problems have good estimates of probabilities on right • P(b|a) needed to identify cause, classification, diagnosis, etc • typical use is to calculate diagnostic knowledgefrom causal knowledge

Bayes' Rule • Causal knowledge: from causes to effects • e.g. P(sneeze|cold) probability of effect sneeze given cause common cold • this probability the doctors obtains from experiencetreating patients and understanding the disease process • Diagnostic knowledge: from effects to causes • e.g. P(cold|sneeze)probability of cause common cold given effect sneeze • knowing this probability helps a doctor make adisease diagnosis based on a patient's symptoms • diagnostic knowledge is more fragile that causal knowledgesince it can change significantly over time given variationsin rate of occurrence of its causes (due to epidemics, etc.)

Bayes' Rule • Using Bayes' Rule with causal knowledge: • want to determine diagnostic knowledge (diagnostic reasoning)that is difficult to obtain from a general population • e.g. symptom is s=stiffNeck, disease is m=meningitis P(s|m) = 1/2 the casual knowledge P(m) =1/50000, P(s) = 1/20 prior probabilities P(m|s) = ? desired diagnostic knowledge (1/2 * 1/50000)/ (1/20) = 1/5000 • doctor can now use P(m|s) to guide diagnosis

Combining Multiple Evidenceusing Bayes' Rule How do you update conditional probabilityof Y given two pieces of evidence A and B? • General Bayes' Rule for multi-valued RVs:P(Y|X) = (P(X|Y) * P(Y)) / P(X) let X=A,B: P(Y|A,B) = (P(A,B|Y) P(Y) ) / P(A,B)= (P(Y) (P(B|A,Y) P(A|Y)) /(P(B|A) P(A)) = P(Y)*(P(A|Y)/P(A))*(P(B|A,Y)/P(B|A)) conditionalized chain rule used, product rule used • Problems: • P(B|A,Y)generally hard to compute or obtain • doesn't scale well for n evidence RVs, table size grows O(2n)

Combining Multiple Evidenceusing Bayes' Rule • Example: • What is the likelihood that a patient has sclerosis colangitis? • doctor's initial belief: P(sc) = 1/1,000,000 • examine reveals jaundice: P(j) = 1/10,000P(j|sc) = 1/5 • doctor's belief given test result:P(sc|j) = P(sc)P(j|sc)/P(j)= 2/1000 • tests reveal fibrosis of bile ducts: P(f|sc) = 4/5P(f) = 1/100 • doctor naïvely assumes jaundice and fibrosis are independent • doctor's belief now rises: P(sc|j,f) = 16/100P(sc|j,f) = P(sc)*(P(j |sc)/P(j)) *(P(f |sc)/P(f )) P(Y|A,B) = P(Y) *(P(A|Y)/P(A))*(P(B|Y) /P(B))

Naïve Bayes Classifier • Naïve Bayes Classifierused where single class is based on a number of featuresor where single cause influences a number of effects based on P(Y|A,B) = P(Y) * (P(A|Y)/P(A)) * (P(B|Y)/P(B)) • given RV C • domain is possible classifications say {c1,c2,c3} • classifies input example of features F1, …, Fn • compute: • P(c1|F1, …, Fn), P(c2|F1, …, Fn), P(c3|F1, …, Fn) • naïvely assume features are independent • choose value for Cthat gives maximum probability • works surprising well in practice evenwhen independence assumption aren't true

Bayesian Networks • AKA: Bayes Nets, Belief Nets, Causal Nets, etc. • Encodes the full joint probability distribution (FJPD) for the set of RVs defining a problem domain • Uses a space-efficient data structure by exploiting: • fact that dependencies between RVs are generally local • which results in lots of conditionally independent RVs • Captures both qualitative and quantitative relationships between RVs

Bayesian Networks • Can be used to compute any value in FJPD • Can be used to reason: • predictive/causal reasoning:forward (top-down) from causes to effects • diagnostic reasoning:backward (bottom-up) from effects to causes

Bayesian Network Representation • Is an augmented DAG (i.e. directed, acyclic graph) • Represented by V,E where • V is a set of vertices • E is a set of directed edges joining vertices, no loops • Each vertex contains: • the RV's name • either a prior probability distribution ora conditional probability distribution table (CDT)that quantifies the effects of the parents on this RV • Each directed arc: • is from cause (parent) to its immediate effects (children) • represents direct causal relationship between RVs

Bayesian Network Representation • Example: in class • each row in conditional probability tables must sum to 1 • columns don't need to sum to 1 • values obtained from experts • Number of probabilities required is typicallyfar fewer than the number required for a FJDT • Quantitative information is usually givenby an expert or determined empirically from data

Conditional Independence • Assume effects are conditionally independentof each other given their common cause • The net is constructed so that given its parents,a node is conditionally independent of its non-descendant RVs in the net: P(X1=x1, ..., Xn=xn) = P(xi | parents(Xi)) * ... * P(xn | parents(Xn)) • Note the full joint probability distribution isn't needed, only need conditionals relative to their parent RVs

Algorithm for ConstructingBayesian Networks • Choose a set of relevant random variables • Choose an ordering for them • Assume they're X1 .. Xmwhere X1 is first, X2 is second, etc. • For i = 1 to m • add a new node forXito the network • set Parents(Xi) to be a minimal subset of {X1.. Xi-1}such that we have conditional independence of Xiand all other members of {X1..Xi-1} given Parents(Xi) • add directed arc from each node in Parents(Xi) to Xi • non-root nodes: define a conditional probability tableP(Xi =x | combinations of Parents(Xi))root nodes: define prior probability distribution at Xi: P(Xi)

Algorithm for ConstructingBayesian Networks • For a given set of random variables (RVs)there is not, in general, a unique Bayesian Netbut all of them represent the same information • For the best net, topologically sort RVs in step 2 • each RV comes before all of its children • first nodes are roots, then nodes they directly influence • Best Bayesian Network for a problem has: • fewest number of probabilities and arcs • easy to determine probabilities for the CDT • Algorithm won't construct a net that violatesthe rules of probability

A B C D Computing Joint Probabilitiesusing a Bayesian Network • Use product rule • Simplify using independence For Example: Compute P(a,b,c,d) = P(d,c,b,a) • order RVS in joint probability bottom up D,C,B,A = P(d|c,b,a) P(c,b,a)Product Rule P(d,c,b,a) = P(d|c)P(c,b,a) Conditional Independ. of D given C = P(d|c)P(c|b,a) P(b,a)Product Rule P(c,b,a) = P(d|c)P(c|b,a)P(b|a) P(a)Product Rule P(b,a) = P(d|c)P(c|b,a)P(b) P(a) Independence of B and Agiven no evidence

A B C D Computing Joint Probabilitiesusing a Bayesian Network How is any joint probability computed? sum the relevant joint probabilities: e.g. Compute: P(a,b) = P(a,b,c,d) + P(a,b,c,Ød) + P(a,b,Øc,d) + P(a,b,Øc,Ød) e.g. Compute: P(c) = P(a,b,c,d) + P(a,Øb,c,d) + PØa,b,c,d) + PØa,Øb,c,d) + P(a,b,c,Ød) + P(a,Øb,c,Ød) + P(Øa,b,c,Ød) + P(Øa,Øb,c,Ød) • A BN can answer any query (i.e. probability) about the domain by summing the relevant joint probs. • Enumeration can require many computations!

Computing Conditional Probabilitiesusing a Bayesian Network • Basic task of probabilistic systemis to compute conditional probabilities. • Any conditional probability can be computed: P(v1,...,vk|vk+1,...,vn) = P(V1=v1,...,Vn=vn) / P(Vk+1=vk+1,...,Vn=vn) • Key problem is that the technique of enumeratingjoint probabilities can make the computations intractable (exponential in the number of RVs).

Computing Conditional Probabilitiesusing a Bayesian Network • These computations generally relyon the simplifications resulting fromthe independence of the RVs. • Every variable that isn't an ancestorof a query variable or an evidence variableis irrelevant to the query. • What ancestors are irrelevant?

A B C A B C D D E E F H F G G H A, B, C, F, H K L M N O P Independence in a Bayesian Network Given a Bayesian Networkhow is independence established? A node is conditionally independent (CI)of its non-descendants, given its parents. e.g. Given D and E, G is CI of ?

A B C A B C D E D E F G H F G H L K K L A, B, C, D, E , H, L, P P M N O P Independence in a Bayesian Network Given a Bayesian Networkhow is independence established? A node is conditionally independent (CI)of its non-descendants, given its parents. e.g. Given D and E, G is CI of ? A, B, C, F, H e.g. Given F and G, K is CI of ?

A B C D D E E F F G H H D, E, F, H, K , L K K L L M N O P Independence in a Bayesian Network Given a Bayesian Networkhow is independence established? A node is conditionally independentof all other nodes in the network givenits parents, children, and children'sparents, which is called a Markov blanket e.g. What is the Markov blanket for G? G Given this blanket G is CI of ? A, B, C, M, N , O, P What about absolute independence?

Computing Conditional Probabilitiesusing a Bayesian Network • The general algorithm for computingconditional probabilities is complicated. • It is easy if the query involves nodesthat are directly connected to each other. examples assumed to use boolean RVs • Simple causal inference: P(E|C) • conditional prob. dist. of effect E given cause C as evidence • reasoning in same direction as arc, e.g. disease to symptom • Simple diagnostic inference: P(Q|E) • conditional prob. dist. of query Q given effect E as evidence • reasoning in direction opposite of arc, e.g. symptom to disease

C C2 E Computing Conditional Probabilities:Causal (Top-Down) Inference Compute P(e|c) conditional probability of effect E=e given cause C=c as evidence assume arc exists to E from C and C2 Rewrite conditional probability of e in termsof e and all of its parents (that aren't evidence)given evidence c Re-express each joint probability backto the probability of e given all of its parents Simplify using independence and Look Uprequired values in the Bayesian Network

Computing Conditional Probabilities:Diagnostic (Bottom-Up) Inference Compute P(c|e) conditional probability of cause C=c given effect E=e as evidence assume arc exists from C to E idea: convert to casual inference using Bayes' rule • Use Bayes' rule P(c|e) =P(e|c) P(c) / P(e) • Compute P(e|c) using causal inference method • Look up value of P(c) in Bayesian Net • Use normalization to avoid computing P(e) • requires computing P(Øc|e) • using steps as in 1 – 3 above

Summary: the Good News • Bayesian Nets are the bread and butter ofAI-uncertainty community (like resolution to AI-logic) • Bayesian Nets are a compact representation • don't require exponential storage to holdall of the info in the full joint probability distribution (FJPD) table • are a decomposed representation of the FJPD table • conditional probability distribution tables in non-root nodes are only exponential in the max number of parents of any node • Bayesian Nets are fast at computing joint probs: P(V1, ..., Vk) i.e. prior probability of V1, ..., Vk • computing the probability of an atomic event can be donein linear time with the number of nodes in the net

Summary: the Bad News • Conditional probabilities can also be computed: P(Q|E1, ..., Ek)posterior probability of query Q given multiple evidence E1, ..., Ek • requires enumerating all of the matching entries,which takes exponential time in the number of variables • in special cases it can be done faster, <= polynomial timee.g. polytree: linear time for nets structured like trees • In general, inference in Bayesian Networks (BN)is NP-hard. but BNs are well studied so there exists many efficient exact solution methods as well as a variety of approximation techniques

Reasoning Under Uncertainty