1 / 26

Case Study: Causal Discovery Methods Using Causal Probabilistic Networks

Case Study: Causal Discovery Methods Using Causal Probabilistic Networks. AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University. Problem Definition.

saburo
Download Presentation

Case Study: Causal Discovery Methods Using Causal Probabilistic Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case Study: Causal Discovery Methods Using Causal Probabilistic Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University

  2. Problem Definition • State-of-the-art algorithms in Bayesian Network learning from data do not scale up to more than a few hundred variables • In an effort to improve such algorithms, the novel algorithm Max-Min Bayesian Network is evaluated in terms of quality of learning and scalability • MMBN identifies the existence of edges (i.e., direct causal relations). We do not direct the edges in this study

  3. Data • It is necessary to know the real causal structure to evaluate such an algorithm (with purely computational methods) • Only way is simulation • Generated Bayesian Networks • Generated data from the joint probability distributions of the networks • MMBN was fed the data and attempted to reconstruct the generating network

  4. Simulated Bayesian Networks • Needed networks that resemble real processes as much as possible • Needed a way to vary the size of the networks to test scalability • Identified a network used in a real decision support system (the ALARM network) • Designed a method (tiling) to generate new larger networks from the original ALARM in a way that shares the structural and probabilistic properties

  5. The ALARM network • a model built for medical diagnosis • 8 diagnoses variables • 16 findings • 13 intermediate nodes • 37 total variables • Discrete values 0-3

  6. Generating Large Real-Like Networks • Problem: generate large random BNs that “look like” networks used in real decision support systems • Idea: • Tile together copies of real networks used in practice • Randomly add a few edges to inter-connect the tiles • Adjust the conditional probability tables introduced by the new edges so that (N)=(N’), where (N) original joint distribution of each tile and (N’) the marginal joint of the tiled network (in other words, retain the joints in the tiles) • Last step requires solving a system of quadratic equations

  7. Input-Output • 0 1 3 2 0 1 … • 2 2 3 1 0 0 … • 0 0 2 1 1 0 … • 0 2 2 1 1 0 … MMBN

  8. Measure of Performance • Sensitivity: ratio of true edges returned over true edges • Specificity: ratio of true edges returned as missing over true edges missing • Easy to optimize each one individually • Sensitivity: return all edges • Specificity: return no edges • Combined measure distance from perfect sensitivity and specificity: • Why not area under ROC? • (some algorithms had no parameters to vary in a way that provides the ROC curve)

  9. Experiment 1: MMBN vs State-of-the-Art Other Algorithms on ALARM • 1000 randomly generated samples from the joint distribution of ALARM were fed to the algorithm • Other algorithms Sparse Candidate using the Mutual Information and Scoring heuristic, k=10, TPDA, and PC • All algorithms took less than a couple of minutes to complete on a Pentium Xeon with 2.4GHz

  10. Experiment 2: • Tiled 270 copies of original ALARM, randomly interconnected them (tiling algorithm) (approximately 10,000 variables) • Randomly sampled 1,000 training instances from the joint distribution of the network • Observation: specificity improves as variables increase. Why? • The rate of increase of false positives (that reduce specificity) is lower than the rate of increase of true negatives

  11. Experiment 3: Reconstructing the Local Neighborhood • What if the number of variables is extremely large (e.g., millions)? • Modified the algorithm to reconstructing the network in an inside-out fashion (breath-first-search) starting from a target node of interest • Algorithm now returns the network in a radius d edges away from target T

  12. Experiment 3: Results • Sampled 1,000 instances from 10,000 variable tiled ALARM • Reconstructed each neighborhood of radius 1,2,3 and 4 with target each node • Calculated the average performance metrics

  13. Experiment 3: Discussion • Sensitivity quickly drops: missing one node at the previous level does not expand that node leading to miss more nodes • Textured nodes are the true positives • Example taken for depth d=2 and for a node with the average sensitivity and specificity

  14. Discovering the Markov Blanket with Max-Min Markov Blanket • Performance measure: Euclidean distance from perfect sensitivity and specificity in discovering MB(T) members. • Sensitivity: percentage of True Positives (i.e., members of MB(T)) identified as positives • Specificity: percentage of True Negatives (i.e., non-members of MB(T)) identified as negatives • Algorithms PC [Spirtes, Scheines, Glymour], Koller-Sahami, Grow-Shrink [Margaritis, Thrun], Incremental Association Markov Blanket [Tsamardinos, Aliferis]

  15. Datasets • Small networks • ALARM, 37 vars • Hailfinder, 56 vars • Pigs, 441 vars • Insurance,27 vars • Win95Pts,76 vars • Large networks (tiled versions) • ALARM-5K (5000 vars) • Hailfinder-5K • Pigs-5K • All variables act as targets in small networks, 10 in large networks

  16. Discovering the Markov Blanket

  17. Discovering the Markov Blanket

  18. Discovering the Markov Blanket 406 “LVFailure” False Positive • Distance of 0.1: sensitivity, specificity = 93% • Distance of 0.2: sensitivity, specificity = 86% • Average Distance of MMMB with 5000 samples: 0.1 • Average Distance of MMMB with 500 samples 0.2 • Example: distance=0.16, ALARM-5K, 5000 sample size 3617 “Shunt” 3608 “PVSat” 3607 “Sa02” 3620 “InsuffAnes” 3604 “ArtCO2” 3603 “Catechol” 3605 “TPR” False Negative

  19. Reconstructing the Full Bayesian Network • Algorithm Max-Min Hill Climbing: • Similar to MMBN but also orients the edges • Comparison with the Sparse Candidate (similar idea of constraining the search). The most prominent BN learning algorithm that scales up to hundreds of variables • Measures of Comparison: • BDeu score (log of the probability of the BN given the data) • Number of structural errors: wrong additions, deletions, or reversal of edges

  20. Bayesian Networks

  21. MMHC versus Sparse Candidate 500 Sample 5000 Sample

  22. MMHC versus Sparse Candidate

  23. MMHC versus Sparse Candidate

  24. MMHC versus Sparse Candidate

  25. MMHC versus Sparse Candidate

  26. Conclusions • “In our view, inferring complete causal models (i.e., causal Bayesian Networks) is essentially impossible in large-scale data mining applications with thousands of variables”, Silverstein, Brin, Motwani, Ullman 2000 • It is possible! Possible using ordinary hardware and a few days computation time • We can learn the full network, or just skeleton of the full network, or the Markov Blanket, or the neighborhoods around a target variable • Quality of reconstruction quite satisfactory • Dramatically reduces the number of manipulations experimentalists have to perform • Learning local models possible but sensitivity drops significantly with distance from target

More Related