Description of our work

What we mean by ``causal inference''

The detection of statistical dependences is a core problem of statistics and machine learning. It plays a crucial role in association studies, and it generalizes the prediction tasks commonly studied by machine learning. In recent years, machine learning methods have enabled us to perform rather accurate prediction, often based on large training sets, for complex nonlinear problems that not long ago would have appeared completely random. However, in many situations we would actually prefer a causal model to a purely predictive one; i.e., a model that might tell us that a specific variable (say, whether or not a person smokes) is not just statistically associated with a disease, but it is causal for the disease. The ultimate method to argue convincingly that such a causal link exists is if we understand the mechanism by which it acts. A slightly weaker, albeit still generally recognized method is to perform a controlled intervention study. The weakest method at present are those that rely solely on observational data. They are weakest in terms of the reliability of the statements they lead to; however, they are strongest in the sense of broad applicability including also problems where mechanistic modeling or interventions are impossible. The work of the group ``causal inference'' at our Lab mainly focuses on discovering causal relations from observational data alone, even though it is still controversial to what extent this is possible --- some would say not at all (without assumptions, this is certainly true), some would say generically only up to ``Markov equivalence classes'', meaning that one could only distinguish between causal graphs that impose different conditional statistical independences.

Existing approaches

One of the first researchers postulating links between causality and statistical dependences was Reichenbach, who already stated in the 1950s that statistical dependences between the occurrences of two events A,B indicate that at least one of the three alternatives is true: A causes B, B causes A, or there is a third event causing both A and B. Moreover, he postulated that in the third case A and B are conditionally statistically independent, given C. This way, causal hypothesis lead to statistically testable implications. This was generalized to causal hypotheses among n variables by Pearl and Spirtes, Glymour and Scheines when they postulated the causal Markov condition: every variable is conditionally independent of its non-effect, given its direct causes. Causal inference based on this framework suffers from the following weaknesses: (1) non-parametric statistical independence tests are challenging, in particular conditional ones. (2) many causal hypotheses impose the same set of conditional independences, they are thus indistinguishable by this approach ("Markov equivalence"). (3) the framework relies on the assumption that the observations are generated by i.i.d. sampling from a constant joint distribution. Causal inference in real life often relies on data with time series structure or even on non-statistical data (when causal relations between single objects need to be inferred for instance from their similarities).

Our contributions

Improving and generalizing the independence-based approach
To overcome the above mentioned limitations of conventional independence based causal inference not only requires the development of appropriate methods in machine learning and statistics (for instance, by further developing kernel methods for conditional independence testing, which we are doing; it also raises the question of postulating causal inference principles whose applicability goes beyond existing ones. To this end, we have explored to what extent causal inference can also be based on non-statistical notions of conditional independence of individual objects (for instance, concluding plagiarism from observing similarities in product design is not based on any statistical dependence). We postulate causal Markov conditions that connect causality not to statistical dependence but to other forms of dependence (measured in terms of non-standard information measures). We have axiomatically defined conditions for suitable information measures and have shown that conditional dependences with respect to such measures can be represented by directed acyclic graphs (DAGs) in the same way as Bayesian networks represent statistical conditional dependences. To what extent one should then give a causal semantics to such DAGs is a profound question. We have shown that the causal Markov condition with respect to general notions of independence is justified if one restricts the attention to a class of causal mechanisms that is adapted to the respective information measure. These mechanism can be seen as generalization of a theorem that derives the causal Markov condition from non-linear structural equations.
Foundations of causal inference principles from algorithmic information theory
An interesting example of such a measure is algorithmic information (``Kolmogorov complexity''). The corresponding causal Markov condition can be derived from an ``algorithmic model'' where every causal mechanism is simulated by a program computing every observation from its direct causes. Postulating that this is always the case in nature is actually only a slight reformulation of the Church Turing Principle since uncomputable mechanisms would provide models of computation that cannot be simulated by a Turing machine. Believing in the generality of the algorithmic model, we have used the Algorithmic Markov Condition as a basis for justifying other causal inference principles. An important implication is that the causal hypothesis X --> Y is only acceptable if the shortest description of P(X,Y) is given by separate descriptions of P(X) and P(Y|X). This kind of algorithmic independence between P(X) and P(Y|X) cannot be tested empirically, but we have developed several different causal inference methods that use testable independence criteria between input and mechanism. Interestingly, there are also methods to infer whether X--> Y or Y-->X if X and Y are related by a deterministic invertible relation Y=f(X) or X=f^{-1}(Y). This is because in the first case peaks of P(X) should not correlate with the determinant of the Jacobian of f while peaks of P(Y) typically correlate with large values of the Jacobi determinant of f^{-1}.
Semi-parametric approaches
Another natural approach to infer whether X--> Y or Y --> X is to prefer the direction that admits the simpler model, with a suitable formalization of simplicity. For some examples of our inference methods we have shown ways to justify this using the algorithmic independence of P(cause) and P(effect|cause): if P(X,Y) admits a simple model P(X)P(Y|X), the causal direction cannot be Y--> X because it would require non-generic adjustments between P(Y) and P(X|Y) when expressing P(X,Y) as P(Y)P(X|Y). An interesting example where this argument holds is the class of ``additive noise models'': If the effect is a function of the cause up to an additive error term that is statistically independent of the cause, there exists typically no additive noise model for the backwards direction, mapping the effect to the cause. While we do not claim that every causal mechanism in nature is of this simple form, it is unlikely that such a model would hold in the wrong direction if the correct direction required a more complex model. Apart from their encouraging performance on real-world data, additive noise models also suggest an interesting shift of paradigms in causal modeling: the criterion whether a causal model is good is not whether the error is small . Instead, the question is whether the error term is independent of the hypothetical cause. To explore the relevance of this paradigm in a more general context is a challenge for further research.
Bayesian approaches
In a first step towards more general classes of causal models, we have developed a non-parametric Bayesian method. It is based on a non-linear structural equation Y=f(X,E) where Y is the effect and X the cause and E is an unobserved noise term. While this model does not restrict the set of possible joint distributions, it is able to give different posterior probabilities for both directions if one chooses appropriate priors for f and the distribution of E and X. Using a Gaussian process prior, the method tends to prefer the causal direction admitting a decomposition into ``smoother'' terms (measured using a suitable covariance kernel).
Evaluating causal inference in the simplest case
To test the performance of all these causal inference methods, we have built a database containing real-world variable pairs (``cause-effect pairs'') where one is known to influence the other (even though additional ``confounding factors'', i.e., latent common causes, cannot be excluded).
Unobserved common causes
However, in real-world scenarios there are usually more variables involved, some of which may be altogether unobserved. We then face the problem of detecting confounders, i.e., inferring whether observed dependences between X and Y are due to a direct causal influence between X and Y, or due to an unobserved common cause. Ideally, one would even like to identify the confounder, say up to an invertible reparameterization. Our first approach to address this difficult task is also based on additive noise models, which makes the problem solvable at least in the low noise limit. Mathematically, the problem then can be formulated as blind deconvolution with appropriate prior knowledge. For this task, too, more general model classes based on Gaussian processes can be utilized, linking it to GP latent variable models. We expect that causal inference will have practical implications for scientific inference problems where interventions are not feasible. It touches statistics, econometrics, and philosophy, and it constitutes one of the most exciting field for conceptual basic research in machine learning today.