# Description of our work

#### What we mean by ``causal inference''

The detection of statistical dependences is a core problem of statistics and machine learning. It plays a crucial role in association studies, and it generalizes the prediction tasks commonly studied by machine learning. In recent years, machine learning methods have enabled us to perform rather accurate prediction, often based on large training sets, for complex nonlinear problems that not long ago would have appeared completely random.
However, in many situations we would actually prefer a * causal * model to a purely predictive one; i.e., a model that might tell us that a specific variable (say, whether or not a person smokes) is not just statistically associated with a disease, but it is causal for the disease. The ultimate method to argue convincingly that such a causal link exists is if we understand the mechanism by which it acts. A slightly weaker, albeit still generally recognized method is to perform a controlled intervention study.
The weakest method at present are those that rely solely on observational data. They are weakest in terms of the reliability of the statements they lead to; however, they are strongest in the sense of broad applicability including also problems where mechanistic modeling or interventions are impossible.
The work of the group ``causal inference'' at our Lab mainly focuses
on discovering causal relations from observational data alone, even though
it is still controversial to what extent this is possible
--- some would say not at all (without assumptions, this is certainly true), some would say generically only up to ``Markov equivalence
classes'', meaning that one could only
distinguish between causal graphs that impose different conditional statistical independences.
#### Existing approaches

One of the first researchers postulating links between
causality and statistical dependences was Reichenbach, who
already stated in the 1950s that
statistical dependences between the occurrences of two events A,B
indicate that at
least one of the three alternatives is true: A causes B, B causes A,
or there is a third event
causing both A and B. Moreover, he postulated that in the third case A
and B are conditionally
statistically independent, given C. This way, causal hypothesis lead to
statistically testable
implications. This was generalized to causal hypotheses among n
variables by
Pearl and Spirtes, Glymour and Scheines when they postulated the causal
Markov condition:
every variable is conditionally independent of its non-effect, given
its direct causes.
Causal inference based on this framework suffers from the following
weaknesses:
(1) non-parametric statistical independence tests are challenging, in
particular conditional ones.
(2) many causal hypotheses impose the same set of conditional
independences, they are thus indistinguishable by this approach ("Markov equivalence").
(3) the framework relies on the assumption that the observations are
generated by i.i.d. sampling from a constant joint distribution.
Causal inference in real life often relies on data with time series
structure
or even on non-statistical data (when causal relations between
single objects need to be inferred for instance from their similarities).
#### Our contributions

##### Improving and generalizing the independence-based approach

To overcome the above mentioned limitations of conventional independence based causal inference not only requires the development
of appropriate methods in machine learning and statistics (for instance,
by further developing kernel methods for conditional independence testing, which we are doing; it also
raises the question of postulating causal inference principles whose
applicability goes beyond existing ones. To this end, we have explored to what
extent causal inference can also be based on non-statistical
notions of conditional independence of individual objects (for instance, concluding plagiarism
from observing similarities in product design is not based on any * statistical * dependence). We postulate causal Markov conditions that connect causality not to statistical dependence but to other forms of dependence (measured in terms of non-standard information measures).
We have axiomatically defined conditions for suitable information measures
and have shown that conditional dependences with respect
to such measures can be represented by directed acyclic graphs (DAGs) in
the same way as Bayesian networks represent statistical conditional
dependences.
To what extent one should then give a causal semantics to such DAGs
is a profound question. We have shown that the causal Markov
condition
with respect to general notions of independence is justified if one
restricts the attention to a class of causal mechanisms that is adapted
to the respective information measure. These mechanism can be seen as generalization of a theorem that derives the causal Markov condition
from non-linear structural equations.
##### Foundations of causal inference principles from algorithmic information theory

An interesting example of such a measure is algorithmic information
(``Kolmogorov complexity''). The corresponding causal Markov condition
can be derived from an ``algorithmic model'' where every causal
mechanism is simulated by
a program computing every observation from its direct causes.
Postulating that this is always the case in nature is actually only a
slight reformulation of the Church Turing Principle since *
uncomputable *
mechanisms would provide models of computation that cannot be simulated by a
Turing machine.
Believing in the generality of the algorithmic model, we have used the
* Algorithmic Markov Condition * as a basis for justifying other
causal inference principles. An important implication is that the causal
hypothesis X --> Y is only acceptable if the shortest description of
P(X,Y) is given by separate descriptions of P(X) and P(Y|X).
This kind of algorithmic independence between P(X) and P(Y|X)
cannot be tested empirically, but we have developed several different
causal inference methods that use testable independence criteria between input and mechanism.
Interestingly, there are also methods to infer whether X--> Y or Y-->X
if X and Y are related by a * deterministic invertible * relation
Y=f(X) or X=f^{-1}(Y). This is because in the first case peaks of P(X)
should
not correlate with the determinant of the Jacobian of f while
peaks of P(Y) typically correlate with large values of the Jacobi
determinant of f^{-1}.
##### Semi-parametric approaches

Another natural approach to infer whether X--> Y or Y --> X is
to prefer the direction that admits the simpler model, with a suitable formalization of simplicity. For some examples of our
inference methods
we have shown ways to justify this using the algorithmic independence
of P(cause) and P(effect|cause): if P(X,Y) admits a simple
model P(X)P(Y|X), the causal direction cannot be Y--> X because
it would require non-generic adjustments between P(Y) and P(X|Y) when expressing P(X,Y) as P(Y)P(X|Y).
An interesting example where this argument holds is the class of
``additive noise models'': If the effect is a function of the cause
up to an additive error term that is statistically independent
of the cause, there exists typically no additive noise model for the backwards direction, mapping the effect to the cause. While we do not claim that every causal
mechanism in nature is of this simple form, it is
unlikely that such a model would hold in the * wrong * direction if the
correct direction required a more complex model. Apart from
their encouraging performance on real-world data, additive noise
models also suggest an interesting shift of paradigms in causal
modeling: the criterion whether a causal model is good is not whether
the error is * small *. Instead, the question is whether
the error term is * independent * of the hypothetical cause.
To explore the relevance of this paradigm in a more general context is
a challenge for further research.
##### Bayesian approaches

In a first step towards more general classes of causal models, we have developed a non-parametric Bayesian method.
It is based on a non-linear structural equation Y=f(X,E) where Y is the effect and X the cause and E is an unobserved noise term. While this model
does not restrict the set of possible joint distributions, it is able to give
different posterior probabilities for both directions if one chooses
appropriate priors for f and the distribution of E and X.
Using a Gaussian process prior, the method
tends to prefer the causal direction admitting a decomposition into
``smoother'' terms (measured using a suitable covariance kernel).
##### Evaluating causal inference in the simplest case

To test the performance of all these causal inference methods, we have
built a database containing real-world variable pairs (``cause-effect
pairs'') where one is known to influence the other (even though
additional ``confounding factors'', i.e., latent common causes, cannot be excluded).
##### Unobserved common causes

However, in real-world scenarios there are usually more variables involved, some of which may be altogether unobserved. We then face the problem of detecting confounders, i.e., inferring whether observed dependences between X and Y
are due to a direct causal influence between X and Y, or due to an unobserved common cause. Ideally, one would even like to identify the confounder, say up to an invertible reparameterization.
Our first approach to address this difficult task is also based on
additive noise models, which makes the problem solvable at least in
the low noise limit. Mathematically, the problem then can be formulated
as blind deconvolution with appropriate prior knowledge. For this task, too, more general model classes based on Gaussian processes can be utilized, linking it to GP latent variable models.
We expect that causal inference will have practical implications for scientific inference problems where interventions are not feasible. It touches statistics, econometrics, and philosophy, and it constitutes one of the most exciting field for conceptual basic research in machine learning today.