A Tutorial on Inference and Learning in Bayesian Networks

# A Tutorial on Inference and Learning in Bayesian Networks

## A Tutorial on Inference and Learning in Bayesian Networks

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com

2. “Road map” • Introduction: Bayesian networks • What are BNs: representation, types, etc • Why use BNs: Applications (classes) of BNs • Information sources, software, etc • Probabilistic inference • Exact inference • Approximate inference • Learning Bayesian Networks • Learning parameters • Learning graph structure • Summary

3. P(A) P(S) Visit to Asia Smoking P(L|S) P(T|A) P(B|S) Lung Cancer Tuberculosis Bronchitis CPD: T L B D=0 D=1 0 0 0 0.1 0.9 0 0 1 0.7 0.3 0 1 0 0.8 0.2 0 1 1 0.9 0.1 ... P(C|T,L) P(D|T,L,B) Chest X-ray Dyspnoea Conditional Independencies Efficient Representation Bayesian Networks P(A, S, T, L, B, C, D) = P(A) P(S) P(T|A) P(L|S) P(B|S) P(C|T,L) P(D|T,L,B) [Lauritzen & Spiegelhalter, 95]

4. Bayesian Networks • Structured, graphical representation of probabilistic relationships between several random variables • Explicit representation of conditional independencies • Missing arcs encode conditional independence • Efficient representation of joint pdf • Allows arbitrary queries to be answered P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

5. Application Output OK Print Spooling On Spool Process OK Spooled Data OK GDI Data Input OK Local Disk Space Adequate Uncorrupted Driver Correct Driver GDI Data Output OK Correct Driver Settings Correct Printer Selected Print Data OK Network Up Net/Local Printing Correct Local Port Correct Printer Path Net Path OK PC to Printer Transport OK Local Path OK Local Cable Connected Net Cable Connected Paper Loaded Printer On and Online Printer Data OK Printer Memory Adequate Print Output OK Example: Printer Troubleshooting (Microsoft Windows 95) [Heckerman, 95]

6. Example: Microsoft Pregnancy and Child Care) [Heckerman, 95]

7. Example: Microsoft Pregnancy and Child Care) [Heckerman, 95]

8. Smoking Visit to Asia Lung Cancer Bronchitis tail-to-tail Tuberculosis Bronchitis Lung Cancer Chest X-ray Dyspnoea Head-to-tail Head-to-head Independence Assumptions

9. Independence Assumptions • Nodes X and Y are d-connected by nodes in Z along a trail from X to Y if • every head-to-head node along the trail is in Z or has a descendant in Z • every other node along the trail is not in Z • Nodes X and Y are d-separated by nodes in Z if they are not d-connected by Z along any trail from X to Y • Nodes X and Y are d-separated by Z implies X and Y are conditionally independent given Z

10. Independence Assumptions A variable (node) is conditionally independent of its non-descendants given its parents Visit to Asia Smoking Lung Cancer Bronchitis Tuberculosis Chest X-ray Dyspnoea

11. Age Gender Exposure to Toxins Smoking Cancer Diet Serum Calcium Lung Tumor Independence Assumptions Cancer is independent of Diet given Exposure to Toxins and Smoking [Breese & Koller, 97]

12. Visit to Asia Smoking Lung Cancer Bronchitis Tuberculosis Chest X-ray Dyspnoea Independence Assumptions What this means is that joint pdf can be represented as product of local distributions P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) . P(B|A,S,T,L) . P(C|A,S,T,L,B) . P(D|A,S,T,L,B,C) = P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) . P(C|T,L) . P(D|T,L,B)

13. Independence Assumptions Thus, the General Product rule for Bayesian Networks is P(X1,X2,…,Xn) = P P(Xi | Pa(Xi)) where Pa(Xi) is the set of parents of Xi n i=1

14. The Knowledge Acquisition Task • Variables: • collectively exhaustive, mutually exclusive values • clarity test: value should be knowable in principle • Structure • if data available, can be learned • constructed by hand (using “expert” knowledge) • variable ordering matters: causal knowledge usually simplifies • Probabilities • can be learned from data • second decimal usually does not matter; relative probs • sensitivity analysis

15. Fuel TurnOver Gauge Start Battery Start Gauge Fuel Battery TurnOver Variable Order is Important Fuel Battery Gauge Causal Knowledge Simplifies Construction TurnOver Start The Knowledge Acquisition Task

16. The Knowledge Acquisition Task Naive Baysian Classifiers [Duda&Hart; Langley 92] Selective Naive Bayesian Classifiers [Langley & Sage 94] Conditional Trees [Geiger 92; Friedman et al 97]

17. The Knowledge Acquisition Task Selective Bayesian Networks [Singh & Provan, 95;96]

18. Classification: P(class|data) Text Classification Medicine Bio-informatics Speech recognition Computer troubleshooting Stock market What are BNs useful for? • Diagnosis: P(cause|symptom)=? • Prediction: P(symptom|cause)=? • Decision-making (given a cost function) • Data mining: induce best model from data

19. Known Predisposing Factors Cause Value Diagnostic Reasoning Unknown but important Effect Decision Imperfect Observations What are BNs useful for? Cause Decision Making - Max. Expected Utility Predictive Inference Effect

20. Action 2 Action 1 Expected Utility Do nothing Probability of fault “i” What are BNs useful for? Value of Information Salient Observations Fault 1 Fault 2 Fault 3 . . . Assignment of Belief New Obs. Act Now! Yes Halt? No Next Best Observation (Value of Information)

21. Why use BNs? • Explicit management of uncertainty • Modularity implies maintainability • Better, flexible and robust decision making - MEU, VOI • Can be used to answer arbitrary queries - multiple fault problems • Easy to incorporate prior knowledge • Easy to understand

22. Application Examples • Intellipath • commercial version of Pathfinder • lymph-node diseases (60), 100 findings • APRI system developed at AT&T Bell Labs • learns & uses Bayesian networks from data to identify customers liable to default on bill payments • NASA Vista system • predict failures in propulsion systems • considers time criticality & suggests highest utility action • dynamically decide what information to show

23. Application Examples • Answer Wizard in MS Office 95/ MS Project • Bayesian network based free-text help facility • uses naive Bayesian classifiers • Office Assistant in MS Office 97 • Extension of Answer wizard • uses naïve Bayesian networks • help based on past experience (keyboard/mouse use) and task user is doing currently • This is the “smiley face” you get in your MS Office applications

24. Application Examples • Microsoft Pregnancy and Child-Care • Available on MSN in Health section • Frequently occuring children’s symptoms are linked to expert modules that repeatedly ask parents relevant questions • Asks next best question based on provided information • Presents articles that are deemed relevant based on information provided

25. Application Examples • Printer troubleshooting • HP bought 40% stake in HUGIN. Developing printer troubleshooters for HP printers • Microsoft has 70+ online troubleshooters on their web site • use Bayesian networks - multiple faults models, incorporate utilities • Fax machine troubleshooting • Ricoh uses Bayesian network based troubleshooters at call centers • Enabled Ricoh to answer twice the number of calls in half the time

26. Application Examples

27. Application Examples

28. Application Examples

29. Online/print resources on BNs • Conferences & Journals • UAI, ICML, AAAI, AISTAT, KDD • MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI • Books and Papers • Bayesian Networks without Tears by Eugene Charniak. AI Magazine: Winter 1991. • Probabilistic Reasoning in Intelligent Systems by Judea Pearl. Morgan Kaufmann: 1998. • Probabilistic Reasoning in Expert Systems by Richard Neapolitan. Wiley: 1990. • CACM special issue on Real-world applications of BNs, March 1995

30. Online/Print Resources on BNs • Wealth of online information at www.auai.org Links to • Electronic proceedings for UAI conferences • Other sites with information on BNs and reasoning under uncertainty • Several tutorials and important articles • Research groups & companies working in this area • Other societies, mailing lists and conferences

31. Publicly available s/w for BNs • List of BN software maintained by Russell Almond at bayes.stat.washington.edu/almond/belief.html • several free packages: generally research only • commercial packages: most powerful (& expensive) is HUGIN; others include Netica and Dxpress • we are working on developing a Java based BN toolkit here at Watson - will also work within ABLE

32. “Road map” • Introduction: Bayesian networks • What are BNs: representation, types, etc • Why use BNs: Applications (classes) of BNs • Information sources, software, etc • Probabilistic inference • Exact inference • Approximate inference • Learning Bayesian Networks • Learning parameters • Learning graph structure • Summary

33. Probabilistic Inference Tasks • Belief updating: • Finding most probable explanation (MPE) • Finding maximum a-posteriory hypothesis • Finding maximum-expected-utility (MEU) decision

34. Belief Updating Smoking lung Cancer Bronchitis X-ray Dyspnoea P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

35. A B C P(a,e=0)= D E “Moral” graph P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)= P(c|a) P(b|a)P(d|b,a)P(e|b,c) Variable Elimination Belief updating: P(X|evidence)=? P(a|e=0) B C D E P(a)

36. Elimination operator bucket B: P(b|a) P(d|b,a) P(e|b,c) B bucket C: P(c|a) C bucket D: D bucket E: e=0 E bucket A: P(a) A P(a|e=0) W*=4 ”induced width” (max clique size) Bucket elimination Algorithm elim-bel (Dechter 1996)

37. bucket B: P(b|a) P(d|b,a) P(e|b,c) B bucket C: P(c|a) C bucket D: D bucket E: e=0 E P(a) bucket A: A MPE W*=4 ”induced width” (max clique size) FindingAlgorithm elim-mpe (Dechter 1996) Elimination operator

38. B: P(b|a) P(d|b,a) P(e|b,c) C: P(c|a) D: E: e=0 A: P(a) Generating the MPE-tuple

39. A B C D E “Moral” graph B E C D D C E B A A Complexity of inference The effect of the ordering:

40. Other tasks and algorithms • MAP and MEU tasks: • Similar bucket-elimination algorithms - elim-map, elim-meu (Dechter 1996) • Elimination operation: either summation or maximization • Restriction on variable ordering: summation must precede maximization (i.e. hypothesis or decision variables are eliminated last) • Other inference algorithms: • Join-tree clustering • Pearl’s poly-tree propagation • Conditioning, etc.

41. Relationship with join-tree clustering BCE ADB A cluster is a set of buckets (a “super-bucket”) ABC

42. Relationship with Pearl’s belief propagation in poly-trees “Causal support” “Diagnostic support” Pearl’s belief propagation for single-root query elim-bel using topological ordering and super-buckets for families Elim-bel, elim-mpe, and elim-map are linear for poly-trees.

43. “Road map” • Introduction: Bayesian networks • Probabilistic inference • Exact inference • Approximate inference • Learning Bayesian Networks • Learning parameters • Learning graph structure • Summary

44. S C B C B X D X D Inference is NP-hard => approximations • Approximations: • Local inference • Stochastic simulations • Variational approximations • etc.

45. Local Inference Idea

46. Bucket-elimination approximation: “mini-buckets” • Local inference idea: bound the size of recorded dependencies • Computation in a bucket is time and space exponential in the number of variables involved • Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables

47. Mini-bucket approximation: MPE task Split a bucket into mini-buckets =>bound complexity

48. Approx-mpe(i) • Input: i – max number of variables allowed in a mini-bucket • Output: [lower bound (P of a sub-optimal solution), upper bound] Example: approx-mpe(3) versus elim-mpe

49. Properties of approx-mpe(i) • Complexity: O(exp(2i)) time and O(exp(i)) time. • Accuracy: determined by upper/lower (U/L) bound. • As i increases, both accuracy and complexity increase. • Possible use of mini-bucket approximations: • As anytime algorithms (Dechter and Rish, 1997) • As heuristics in best-first search (Kask and Dechter, 1999) • Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)

50. Anytime Approximation