Reasoning Under Uncertainty

Reasoning Under Uncertainty Radu Marinescu 4C @ University College Cork

Why uncertainty? • Uncertainty in medical diagnosis • Diseases produce symptoms • In diagnosis, observed symptoms => disease ID • Uncertainties • Symptoms may not occur • Symptoms may not be reported • Diagnostic tests are not perfect • False positive, false negative • How do we estimate confidence? • P(disease | symptoms, tests) = ?

Why uncertainty? • Uncertainty in medical decision-making • Physicians, patients must decide on treatments • Treatments may not be successful • Treatments may have unpleasant side effects • Choosing treatments • Weigh risks of adverse outcomes • People are BAD at reasoning intuitively about probabilities • Provide systematic analysis

Outline • Probabilistic modeling with joint distributions • Conditional independence and factorization • Belief (or Bayesian) networks • Example networks and software • Inference in belief networks • Exact inference • Variable elimination, join-tree clustering, AND/OR search • Approximate inference • Mini-clustering, belief propagation, sampling

Bibliography • Judea Pearl. “Probabilistic reasoning in intelligent systems”, 1988 • Stuart Russell & Peter Norvig. “Artificial Intelligence. A Modern Approach”, 2002 (Ch 13-17) • Kevin Murphy. "A Brief Introduction to Graphical Models and Bayesian Networks" http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html • Rina Dechter. "Bucket Elimination: A Unifying Framework for Probabilistic Inference" http://www.ics.uci.edu/~csp/R48a.ps • Rina Dechter. "Mini-Buckets: A General Scheme for Approximating Inference" http://www.ics.uci.edu/~csp/r62a.pdf • Rina Dechter & Robert Mateescu. "AND/OR Search Spaces for Graphical Models". http://www.ics.uci.edu/~csp/r126.pdf

Reasoning under uncertainty • A problem domain is modeled by a list of (discrete) random variables: X1, X2, …, Xn • Knowledge about the problem is represented by a joint probability distribution: P(X1, X2, …, Xn)

Example • Alarm (Pearl88) • Story: In Los Angeles, burglary and earthquake are common. They both can trigger an alarm. In case of alarm, two neighbors John and Mary may call 911 • Problem: estimate the probability of a burglary based on who has or has not called • Variables: • Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M) • Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)

Joint probability distribution Defines probabilities for all possible value assignments to the variables in the set

Inference with joint probability distribution • What is the probability of burglary given that Mary called, P(B=y | M=y)? • Compute marginal probability: • Compute answer (reasoning by conditioning):

Advantages • Probability theory well-established and well understood • In theory, can perform arbitrary inference among the variables given a joint probability. This is because the joint probability contains information of all aspects of the relationships among the variables • Diagnostic inference: • From effects to causes • Example: P(B=y | M=y) • Predictive inference: • From causes to effects • Example: P(M=y | B=y) • Combining evidence: P(B=y | J=y, M=y, E=n) • All inference sanctioned by probability theory and hence has clear semantics

Difficulty: complexity in model construction and inference • In Alarm example: • 32 numbers needed (parameters) • Quite unnatural to assess • P(B=y, E=y, A=y, J=y, M=y) • Computing P(B=y | M=y) takes 29 additions • In general, • P(X1, X2, …, Xn) needs at least 2n numbers to specify the joint probability distribution • Knowledge acquisition difficult (complex, unnatural) • Exponential storage and inference

Outline • Probabilistic modeling with joint distributions • Conditional independence and factorization • Belief networks • Example networks and software • Inference in belief networks • Exact inference • Approximate inference • Miscellaneous • Mixed networks, influence diagrams, etc.

Chain rule and factorization • Overcome the problem of exponential size by exploiting conditional independencies • The chain rule of probability: • No gains yet. The number of parameters required by the factors is still O(2n)

Conditional independence • A random variable X is conditionally independent of a set of random variables Y given a set of random variables Z if • P(X | Y, Z) = P(X | Z) • Intuitively: • Y tells us nothing more about X than we know by knowing Z • As far as X is concerned, we can ignore Y if we know Z

Conditional independence • About P(Xi|X1,…,Xi-1): • Domain knowledge usually allows one to identify a subset pa(Xi)  {X1, …, Xi-1} such that • Given pa(Xi), Xi is independent of all variables in {X1,…,Xi-1} \ pa(Xi), i.e. P(Xi | X1, …, Xi-1) = P(Xi | pa(Xi)) • Then • Joint distribution factorized! • The number of parameters might have been substantially reduced

Example continued • pa(B) = {}, pa(E) = {}, pa(A) = {B,E}, pa(J) = {A}, pa(M) = {A} • Conditional probability tables (CPT)

Example continued • Model size reduced from 32 to 2+2+4+4+8=20 • Model construction easier • Fewer parameters to assess • Parameter more natural to assess • e.g., P(B=y), P(J=y | A=y), P(A=y | B=y, E=y), etc. • Inference easier. Will see this later.

Outline • Probabilistic modeling with joint distributions • Conditional Independence and factorization • Belief networks • Example networks and software • Inference in belief networks • Exact inference • Approximate inference

From factorization to belief networks • Graphically represent the conditional independency relationships: • Construct a directed graph by drawing an arc from Xj to Xi iff Xj pa(Xi) • Also attach the CPT P(Xi | pa(Xi)) to node Xi P(B) B E P(E) A P(A|B,E) P(J|A) J M P(M|A)

Formal definition • A belief network is: • A directed acyclic graph (DAG), where: • Each node represents a random variable • And is associated with the conditional probability of the node given its parents • Represents the joint probability distribution: • A variable is conditionally independent of its non-descendants given its parents

Independences in belief networks • 3 basic independence structures 1: chain Burglary Burglary Earthquake Alarm 2: common descendants Alarm JohnCalls 3: common ancestors Alarm JohnCalls MaryCalls

Independences in belief networks Burglary Alarm JohnCalls 1. JohnCallsis independent of Burglary given Alarm

Independences in belief networks Burglary Earthquake Alarm 2. Burglary is independent of Earthquake not knowing Alarm. Burglary and Earthquake become dependent given Alarm!!

Independences in belief networks Alarm JohnCalls MaryCalls 3. MaryCallsis independent of JohnCallsgiven Alarm.

Independences in belief networks • BN models many conditional independence relations relating distant variables and sets, which are defined in terms of the graphical criterion called d-separation • d-separation = conditional independence • Let X, Y and Z be three sets of nodes • If X and Y are d-separated by Z, then X and Y are conditionally independent given Z: P(X|Y, Z) = P(X|Z) • d-separation in the graph: • A is d-separated from B given C if every undirected path between them is blocked • Path blocking • 3 cases that expand on three basic independence structures

Undirected path blocking • With “linear” substructure • With “wedge” substructure (common ancestors) • With “vee” substructure (common descendants) Z in C X Y Y X Z or any of its descendants notin C X Y Z in C

Example Z • X = {2} and Y = {3} are d-separated by Z = {1} • path 2  1  3 is blocked by 1  Z • path 2  4  3 is blocked because 4 and all its • descendants are outside Z 1 X 2 3 Y 4 • X = {2} and Y = {3} are not d-separated by Z = {1,5} • path 2  1  3 is blocked by 1  Z • path 2  4  3 is activated because 5 (which is • a descendant of 4) is in Z • learning the value of consequence 5 renders • 5’s causes 2 and 3 dependant 5

I-mapness • Given a probability distribution P on a set of variables {X1, …, Xn}, a belief network B representing P is a minimal I-map (Pearl88) • I-mapness: every d-separation condition displayed in B corresponds to a valid conditional independence relationship in P • Minimal: none of the arrows in B can be deleted without destroying its I-mapness

MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV PVSAT ANAPHYLAXIS ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Example network The “alarm” network: Monitoring Intensive-Care Patients 37 variables, 509 parameters (instead of 237)

Software • GeNIe (University of Pittsburgh) - free • http://genie.sis.pitt.edu • SamIam (UCLA) - free • http://reasoning.cs.ucla.edu/SamIam/ • Hugin - commercial • http://www.hugin.com • Netica - commercial • http://www.norsys.com • UCI Lab – free but no GUI • http://graphmod.ics.uci.edu/

GeNIe screenshot

Applications • Belief networks are used in: • Genetic linkage analysis • Speech recognition • Medical diagnosis • Probabilistic error correcting coding • Monitoring and diagnosis in distributed systems • Troubleshooting (Microsoft) • …

Outline • Probabilistic modeling with joint distributions • Conditional independence and factorization • Belief networks • Inference in belief networks • Exact inference • Approximate inference

Exact inference • Variable elimination (inference) • Bucket elimination • Bucket-Tree elimination • Cluster-Tree elimination • Conditioning (search) • VE+C hybrid • AND/OR search (tree, graph)

Belief updating Smoking Lung cancer Bronchitis Dyspnoea X-ray P(Lung cancer = yes | Smoking = no, Dyspnoea = yes) ?

Probabilistic inference tasks • Belief updating • Maximum probable explanation (MPE) • Maximum a posteriori hypothesis (MAP)

The bucket operation ELIMINATION: multiply (*) and sum (∑) bucket(B): { P(E|B,C), P(D|A,B), P(B|A) } λB(A,C,D,E) = ∑B P(B|A)*P(D|A,B)*P(E|B,C) OBSERVED BUCKET: bucket(B): { P(E|B,C), P(D|A,B), P(B|A), B=1 } λB(A) = P(B=1|A) λB(A,D) = P(D|A,B=1) λB(E,C) = P(E|B=1,C)

Multiplying functions

Summing out a variable

Bucket elimination ∑∏ Elimination operator Bucket B: P(E|B,C), P(D|A,B), P(B|A) B Bucket C: P(C|A) λB(A,D,C,E) C Bucket D: λC(A,D,E) D Bucket E: E=0 λD(A,E) E w* = 4 “induced width” (max clique size) Bucket A: P(A) λE(A) A P(A,E=0)

Induced graph P(A) A B A P(B|A) B C P(C|A) C B C D E P(E|B,C) D P(D|A,B) D E E A Induced width of the ordering w*(d) || max width of the nodes

Complexity of elimination A w*(d) – induced width of the moral graph along ordering d B C B E D E C D D C E B “Moral” graph A A w*(d1) = 4 w*(d2) = 2

Finding small induced-width orderings • NP-complete • A tree has induced width of ? • Greedy algorithms: • Min-width • Min induced-width • Max-cardinality • Min-fill (thought as the best) • Anytime min-width (via Branch-and-Bound)

MPE: Most Probable Explanation Smoking Lung Cancer Bronchitis Dyspnoea X-ray

y0u y1u y2u y3u y4u u0 u1 u2 u3 u4 x0 x1 x2 x3 x4 y0x y1x y2x y3x y4x Applications • Probabilistic decoding • A stream of bits is transmitted across a noisy channel and the problem is to recover the transmitted stream given the observed output and parity check bits Received bits (observed) Transmitted bits Parity check bits Received parity check bits (observed)

Applications • Medical diagnosis • Given some observed symptoms, determine the most likely subset of diseases that may explain the symptoms Disease3 Disease7 Disease2 Disease4 Disease1 Disease6 Disease5 Symptom6 Symptom1 Symptom3 Symptom5 Symptom2 Symptom4

A A B B a a b b A a B b Applications • Genetic linkage analysis • Given the genotype information of a pedigree, infer the maximum likelihood haplotypeconfiguration (maternal and paternal) of the unobserved individuals L11m L11f L12m L12f X11 S13m X12 S13f L13f L13m haplotype X13 Locus 1 L21m L21f L22m L22f X21 S23m X22 S23f 1 2 L23f L23m 3 Locus 2 X23 genotyped (Fishelson & Geiger, 2002)

Reasoning Under Uncertainty