Graphical Causal Models

Graphical Causal Models Clark Glymour Carnegie Mellon University Florida Institute for Human and Machine Cognition

Outline • Part I: Goals and the Miracle of d-separation • Part II: Statistical/Machine Learning Search and Discovery Methods for Causal Relations • Part III: A Bevy of Causal Analysis Problems

I. Brains, Trains, and Automobiles: Cognitive Neuroscience as Reverse Auto Mechanics Idea: Like autos, like trains, like computers, brains have parts. The parts influence one another to produce a behavior. The parts can have roles in multiple behaviors. Big parts have littler parts.

I. Goals of the Automobile Hypothesis • Overall goals: • Identify the parts critical to behaviors of interest. • Figure out how they influence one another, in what timing sequences. • Imaging goals • Identify relatively BIG parts (ROIs). • Figure out how they influence one another, with what timing sequences, in producing behaviors of interest.

I. Goal: From Data to Mechanisms A X Y Z W Causal Relations among Neurally Localized Variables B C D Multivariate Time Series

I. Graphical Causal Models: the Abstract Structure of Influences Vehicle deceleration Friction of pads Friction of shoe against rotor against wheel Fluid in caliper Fluid in wheel cyiinder Fluid level in master cylinder Push brake This system is deterministic (we hope)

I. Philosophical Objections • “Cause” is a vague, metaphysical notion. • Response: Compare “probability.” • “Probability” has a mathematical structure. “Causation” does not. • Response: See Spirtes, et al., Causation, Prediction and Search, 1993, 2000; Pearl, Causality, 2000. Listen to Pearl’s lecture this afternoon. • The real causes are at the synaptic level, so talk of ROIs as causes is nonsense. “…for many this rhetoric represents a category error…because causal [sic] is an attribute of the state equation.” (Friston, et al, 2007, 602.) • Response: So, do you think “smoking causes cancer” is nonsense? “Human activities cause global temperature increases” is nonsense? “Turning the ignition key causes the car to start” is nonsense?

I. The Abstract Structure of Influences This system is not deterministic Linear causal models (SEMs) specify a directed graphical structure. MedFGlb : = a CING(b) + e1 STG(b) : = b CING(b) + e2 IPL(b) := c STG(b) + d CING(b) + e3 e1, e2, e3 jointly independent But so does any functional form of the influences: MedFGlb : = f(CING(b) ) + e1 STG(b) : = g(CING(b) + e2 IPL(b) := h(STG(b), CING(b)) + e3 e1, e2, e3 jointly independent S. Hanson, et al., 2008. Middle Occipital Gyrus (mog), Inferior Parietal Lobule (ipl), Middle Frontal Gyrus(mfg), and Inferior Frontal Gyrus (ifg) Middle Occipital Gyrus (mog), Inferior Parietal Lobule (ipl), Middle Frontal Gyrus(mfg), and Inferior Frontal Gyrus (ifg)

I. So What? 1. The directed graph codes the conditional independence relations implied by the model: MedFGl(b) II{STG(b), IPL(b}) | CING(b). 2. (Almost) All of our tests of models are tests of implications of their conditional independence claims. So what is the code?

I. d-separation Is the Code! X II {Z, W} | Y X II W | Z NOT X II W | R NOT X || W |S NOT X || W | {Y, Z, R} NOT X || W |{Y, Z, S} Conditioning on a variable in a directed path between X, W blocks the association produced by that path Conditioning a variable that is a descendant of X, W creates a path that produces an association between X, W X Y Z W R S J. Pearl, 1988 What about systems with cycles? d-separation characterizes conditional independence relations in all such linear systems. P. Spirtes, 1996

I. How To Determine If Variables A and Z Are Independent Conditional on a Set Q of Variables. • Consider each sequence p of edge adjacent variables (each in any direction) without self intersections terminating in A and Z. • A collider on p is a variable N on p such that variables M, O on p each have edges directed into N: M -> N <- O • Sequence (path) p creates a dependency between A and Z conditional on Q if and only if: • No non-collider on p is in Q. • Every collider on p is in Q or has a descendant in Q (a directed path from the collider to a member of Q.)

II. So, What Can We Do With It? • Exploit d-separation in conjunction with distribution assumptions to estimate graphical causal structure from sample data. • Understand when data analysis and measurement methods distort conditional independence relations in target systems. • Wrong conditional independence relations => wrong d-separation relations => wrong causal structure.

II. Simple Illustration (PC) Consequences: X || Z {X,Z} || W | Y Method: • Spirtes, Glymour, & Scheines. (1993). Causation, Prediction, & Search, Springer Lecture Notes in Statistics.

II. Bayesian Search: Greedy Equivlence Search (GES) Data Start with empty graph. Add or change the edge that most increases fit. Iterate. Chickering and Meek, Uncertainty in Artificial Intelligence Proceedings, 2003 Model with highest posterior probability

II. With Unknown, Unrecorded Confounders: FCI Truth X Y Z W Unrecorded Variable Data FCI Consistent estimator under i.i.d. sampling Spirtes, et al., Causation, Prediction and Search X Y Z W But in other cases is often uninformative

II. Overlapping Databases: ION But in other cases often generates a number of alternative models ION algorithm recoversthe full graph! Danks, Tillman and Glymour, NIPS, 2008.

II. Time Series (Structural VAR) • Basic idea: PC or GES style search on “relative” time-slices • Additive, non-linear model of climate teleconnections (5 ocean indices; 563-month series) • Chu & Glymour, 2008, Journal of Machine Learning Research

T2 M1M2M3 T3 M5 M6 T1 M9M10M11M12 II. Discovering Latent Variables Apply GES Cluster M’s using a heuristic or Build Pure Clusters (Silva, et al. JMLR. 2006) Applicable to time series?

II. Limits of PC and GES • With i.i.d. samples, and correct distribution families, PC and GES give correct information almost surely in the large sample limit—assuming no unrecorded common causes • Works with “random effects “ for linear models. • But doe not give all the information we want: Often cannot determine the directions of influences! • Can post process with exhaustive test for all orientations—heuristic. • Adjacencies more reliable than directions of edges …predicts the same independencies as… All of these are d-separation equivalent

II. Breaking Down d-separation Equivalence: LiNGAM Linear equations (reduced): X = X Y = aXX + Y Z = bXX + bYY + Z Disturbance terms must be non-Gaussian Discoverable byLiNGaM (ICA + algebra)! Shimizu, et al. (2006) Journal of Machine Learning Research

II. Feedback Systems Two methods: • Modified LiNGaM • Lacerda, Spirtes, & Hoyer (2008). Discovering cyclic causal models by independent component analysis. UAI. • Conditional independencies • Richardson & Spirtes (1999). Discovery of linear cyclic models.

II. Missed Opportunities? • None of the machine learning/statistical methods in II. have been used with imaging data. Instead: • Trial and error guessing and data fitting • Regression • Granger Causality for time series. • Exhaustive testing of all linear models. • How come? • Unfamiliarity • The machine learning/statistical methods respect what it is possible to learn (in the large sample limit), which is often less than researchers want to conclude.

III. Simple Possible Errors • Pooling data from different subjects: • If X and Y are independent in population P1 and in population P2, but have different probability distributions in the two populations, the X and Y are not usually not independent in P1  P2. (G. Yule, 1904). • Pooling data from different time points in fMRI series • If the series is not stationary, data are being pooled as above. • Can remove trends but that doesn’t guarantee stationarity.

III. Eliminating Opportunities • Removing autocorrelation by regression interferes with discovering feedback between variables. • Data manipulations that tend to make variables Gaussian • Spatial smoothing • Variables defined by principal components or averages over ROIs eliminate or reduce the possibility of taking advantage of LiNGAM algorithms.

III. Simple Limitations • Testing all models (e.g., with LISREL chi-square) is a consistent search method for linear, Gaussian models (folk theorem). • But it is not feasible except for very small numbers of variables, e.g., for 8 variables there are 324 = 22,876,792,454,961 directed graphs.

III. Not So Simple Possible Errors: Variables Defined on ROIs as Proxies for Latent Variables X Y Z A B C e X is independent of Z conditional on Y But unless B is a perfect measure of Y, A is not independent of C conditional on B. So if A, B, and C are taken as “proxies” for X, Y and Z, a regression of C on A and B will find, correctly, that X has an indirect influence on Z, through Y, but also, incorrectly, that X has in addition a direct influence on Z not through Y.

III. Not So Obvious Errors: Regression • Lots of forms: linear, polynomial, logistic, etc. • All have the following features: • Prior separation of variables into outcome, Y, and a set S of possible causes, A, B, C, etc. of Y. • Regression estimate of the influence of A on Y is a measure of the association of A and Y conditional on all other variables in S. • Regression for causal effects always attempts to estimate the direct (relative to other variables in S) influence of A on Y.

III. Regression to Estimate Causal Influence • Let V = {X,Y,T}, where - Y : measured outcome - measured regressors: X = {X1, X2, …, Xn}- latent common causes of pairs in X U Y: T = {T1, …, Tk} • Let the true causal model over V be a Structural Equation Model in which each V  V is a linear combination of its direct causes and independent, Gaussian noise.

III. Regression to estimate Causal Influence Consider the regression equation: Y = 0 + 1X1 + 2X2 + ..… nXn Let the OLS regression estimate bi be the estimatedcausal influence of Xi on Y. That is, hypothetically holding X/Xi experimentally constant, i is an estimate of the change in E(Y) that would result from an intervention that changes Xi by 1 unit. Let the realCausalInfluence Xi Y = bi When is the OLS estimate bi a consistent estimate of bi?

III. Regression Will Be “inconsistent” When • 1. There is an unrecorded common cause of Y and Xi L Xi Y  If X, Y are the only measured variables, PC, GES and FCI cannot determine whether the influence is from X to Y or from an unmeasured common cause, or both. LiNGAM can if the disturbances are non-Gaussian.

Regression will be “inconsistent” when 2. Cause and effect are confused: Xi Y 3. And that error can lead to others: Xi Y Xk “…one region, with a long haemodynamic latency, could cause a neuronal response in another that was expressed, haedynamically, before the source.” (Friston, et al., 2007, 602). LiNGAM does not make this error. Regression concludes Xk is cause of Y. FCI, etc. do not make these errors.

Bad Regression Example 1  0  2  0 X 3  0 X Multiple Regression Result PC, GES, FCI get these kinds of cases right.

Regression Consistency If • Xi is d-separated from Y conditional on X\Xi in the true graph after removing Xi Y, and • X contains no descendant of Y, then: iis a consistent estimate of bi

III. Granger Causality Idea: Time series X is a Granger cause of Y iff stationary {…..Xt-1; ….Yt-1} predicts Yt better than does {….Yt-1} Obvious Generalizations: • Non-Gaussian time series. • Multiple time series—essentially time series version of multiple regression: X is a Granger cause of Y iff Yt is not independent of …Xt-1 conditional on covariates …Zt-1. Less obvious generalizations: • Non-linear time series (finding conditional independence tests is touchy) C. Granger, Econometrica, 1969

GC All Over the Place • Goebel, R. Roebroeck, A. Kim, D. and Formisano, E. (2003). Investigating directed cortical interactions in time-resolved fMI data using vector autoregressive modeling and Granger causality mapping. Magnetic Resonance Imaging, 21: 125-161. • Chen, Y. Bressler, S.L., Knuth, K.H., Truccolo, W.A., Ding, M.Z., (2006). Stochastic modeling of neurobiological time series: power, coherence, Granger causality, and separation of evoked responses from ongoing activity. Chaos 16, 26-113. • Brovelli, A., Ding, M.Z., Ledberg, A., Chen, Y.H., Nakamura, R., Bressler,S.L., (2004). Beta oscillations in a large-scale sensorimotor cortical network: directional influences revealed by granger causality. Proc. Natl. Acad. Sci. U. S. A. 101: 9849–9854. • Deshpande, G., Hu, ., Stilla, R, and K. Sathian, (2008) Effective connectivity during haptic perception: A study using Granger causality analysis of functional magnetic resonance imaging data. NeuroImage, 40: 1807-1814.

III. Problems with GC • fMRI series with multiple conditions are not stationary --May not always be serious. • GC can produce causal errors when there is measurement error or unmeasured confounding series. • --Open research problem: find a consistent method to identify unrecorded common causes of time series, akin to Silva, et al., JMLR 2006 for equilibrium data; Glymour and Spirtes, J. of Econometrics, 1988.

III. If Xt records an event occurring later than Yt+1, X may be mistakenly taken to be a cause of Y. (Friston, 2007, again.) • This is a problem for regression; • Not a problem if PC, FCI, GES or LiNGAM are used in estimating the “Structural VAR” because they do not require a separation of variables into outcome and potential cause, or a time ordering of variables.

III. Granger Causality and Mechanisms • Neural signals occur faster than fMRI sampling rate—what is going on in between? Granger Causes ARE: W X Y Z X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 Spurious edges Unobserved

III. Analysis of Residuals • Regress and apply PC, etc. to residuals Regress on X1, Y1, Z1, W1; W X Y Z X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 Unobserved Swanson and Granger, JASA; Demiralp and Hoover (2003), Oxford Economic Bulletin

Conclusion • Causal inference from imaging data is about as hard as it gets; • Conventional statistical procedures are radically insufficient tools; • Lots of unused potentially relevant, principled, tools in the Machine Learning literature; • Measurement methods and data transformations can alter the probability distributions in destructive ways; • Graphical causal models are the best available tool for thinking about the statistical constraints that causal hypotheses imply.

Things There Aren’t: Magic Wands Pixie Dust

If You Forget Everything Else in This Talk, Remember This: • P. Spirtes, et al., Causation, Prediction and Search, Springer Lecture Notes in Statistics, 2nd edition MIT Press, 2000 • J. Pearl, Causality, Oxford, 2000. • Uncertainty in Artificial Intelligence Annual Conference Proceedings • Journal of Machine Learning Research • Peter Spirtes’ webpage • Judea Pearl’s web page. • The TETRAD webpage.

Graphical Causal Models

Graphical Causal Models

Presentation Transcript

Graphical Models

Graphical Models of Probability for Causal Reasoning

Incomplete Graphical Models

Graphical Models

Graphical Models

Causal Rasch Models

Graphical Models

Graphical Models - Inference -

GRAPHICAL MODELS

Probabilistic Graphical Models

When are Graphical Causal Models not Good Models? CAPITS 2008

Dynamic Causal Models

Graphical Models

Dynamic Causal Models

Probabilistic Graphical Models

Causal Inference and Graphical Models

Dynamic Causal Models

Graphical Models

Discovering Causal Models