Shai Ben-David - University of Waterloo
Is Learning Possible without Prior
Dec 8, 2011 10:00 - 10:40
Ingo Steinwart - University of Stuttgart
Statistical Analysis of SVMs
Dec 8, 2011 10:40 - 11:20
Since their invention by Vladimir Vapnik and his co-workers in
the early nineties, SVMs have attracted a lot of research activities from various
communities. While at the beginning this research mostly focused on generalization
bounds, the last decade witnessed a shift towards oracle inequalities and learning
rates. In this talk I will discuss some of the latter developments, in particular in
view of least squares and quantile regression, binary classification, and anomaly
Volodya Vovk - Royal Holloway, University of London
Kernel Ridge Regression
Dec 8, 2011 11:20 - 12:00
Kernel ridge regression (KRR) is a simplified version of support vector regression.
The main formula of KRR is identical to a formula in kriging, a Bayesian method widely used in geostatistics.
But KRR has performance guarantees that have nothing to do with the assumptions needed for kriging.
I will discuss two kinds of such performance guarantees: those not requiring any stochastic assumptions
and those depending only on the iid assumption.
Manfred Opper - TU Berlin
Assessing the Quality of Approximate Inference for Bayesian Kernel
Dec 8, 2011
13:30 - 14:10
Models with Gaussian process priors over latent functions can be understood as Bayesian versions of kernel
machines. Unfortunately, except for the case of regression with Gaussian noise, these models do not allow for exact inference.
Efficient approximation techniques such as the expectation propagation (EP) algorithm have been developed to overcome this problem.
Empirical comparisons with extensive Monte Carlo inference on a variety of benchmark data sets for Gaussian processes classifiers
have shown that EP can yield excellent approximations. However, such a positive result may not hold in general. In this talk we
show how the error of the EP approximation for Gaussian process models can be expressed analytically in terms of a series expansion.
Low order terms of the expansion can be used to get a practical estimate of the quality of EP.
Joint work with Ulrich Paquet (Microsoft Cambridge) and Ole Winther (TU Kopenhagen).
Andreas Christmann - Universitat Bayreuth
On Stability Properties of Support Vector
8, 2011 14:10 - 14:50
Support Vector Machines (SVMs) play an important role in modern statistical learning theory.
The original SVM approach by Boser, Guyon and Vapnik (1992) was derived from the generalized
portrait algorithm invented by Vapnik and Lerner (1963). The books by Vapnik (1982, 1995, 1998)
and later on by Cristianini and Shawe-Taylor (2000) and Scholkopf and Smola (2002) had a large
impact on the development of SVMs and their success in various field of applications. It is well-
known that SVMs have nice numerical properties and that they are the solution of a well-defined
mathematical problem in Hadamard’s sense.
The talk will briefly summarize some properties of SVMs, which show that SVMs additionally have nice
properties from the view point of statistical stability with respect to the unknown underlying distribution.
Ulrike von Luxburg - University of Hamburg
Random Walk Distances on
8, 2011 14:50 - 15:30
We present simple procedures for the prediction of a real valued time series
with side information. For squared loss, the prediction
algorithms are based on a machine learning combination of several simple predictors. We
show that if the sequence is a realization of a stationary and ergodic random
process then the average of squared errors converges, almost surely,
to that of the optimum, given by the Bayes predictor. We offer an analog
result for the prediction of stationary gaussian processes, and show an open problem. These
prediction strategies have some consequences for 0−1 loss (pattern recognition problem for time series).
László Györfi - Budapest University of Technology and Economics
Prediction of Stationary Time
8, 2011 16:00 - 16:40
We present simple procedures for
the prediction of a real valued time
series with side information. For
squared loss, the prediction
algorithms are based on a machine
learning combination of several
simple predictors. We show that if
the sequence is a realization of a
stationary and ergodic random
process then the average of squared
errors converges, almost surely, to
that of the optimum, given by the
Bayes predictor. We offer an analog
result for the prediction of
stationary gaussian processes, and
show an open problem. These
prediction strategies have some
consequences for 0−1 loss (pattern
recognition problem for time
Peter Bühlmann - ETH
Dec 8, 2011 16:40 -
Understanding cause-effect relationships between variables is of
interest in many fields of science. It is
desirable to extract causal information from observational data
obtained by observing a system of interest without subjecting it to
interventions (i.e. without randomized experiments). When assuming no or little
information about (causal) influence diagrams, the problem in its full
generality is ill-posed. However, we will discuss how sparse graphical
modeling and intervention calculus can be used for quantifying
useful bounds for causal effects, even for the high-dimensional, sparse
case where the number of variables can greatly exceed sample size.
Besides methodology, computation and theory,
we will illustrate validation of the method with gene intervention
experiments in yeast (Saccharomyces cerevisiae) and arabidopsis
Leon Bottou - Microsoft Research
About the origins of the VC
2011 17:20 - 18:00
Klaus-Robert Mueller - TU Berlin
15 years of Kernel-based
8, 2011 18:00 - 19:00
Bernhard Schölkopf - MPI for Intelligent Systems
Inference of Cause and
Dec 9, 2011 10:00 - 10:40
Alexandr Tsybakov - Université Paris 6
Optimal Exponential Bounds for the Accuracy of
Dec 9, 2011 10:40 -
Bob Williamson - NICTA
Theory of Loss
Dec 9, 2011 11:20 -
The decision-theoretic approach to statistics and machine learning is built upon
the idea of a loss function which measures the accuracy of a prediction. In most work (including in
Vapnik’s books) really only three loss functions are considered. But there are many other possibilities.
In the talk I will focus on proper losses for probability estimation, and present some old and new
results that demonstrate the richness of loss functions and the significance of their study.
Alex Smola - Yahoo! Research
The Mean Trick
Dec 9, 2011 13:30 -
In this talk I will give an overview over the mean trick, that is, the use
of Hilbert Space embeddings for expectation operators. It allows one to unify a large number of techniques ranging
from two-sample tests to visualization, feature extraction, and graphical models.
Vladimir Vapnik - NEC Laboratories America, Inc.
Dec 9, 2011 14:10 -
Larry Jackel - North-C Tecnologies, Inc.
Machine Learning Applications at Bell Labs: Before and After the
Arrival of Vladimir Vapnik
Dec 9, 2011 19:00 - 19:40
Olivier Chapelle - Yahoo! Research
Click Modeling for Display
Dec 10, 2011 10:40 -
Naftali Tishby - The Hebrew University
Dec 10, 2011 10:00 -
A fundamental problem of learning theory is finding simple functions
that capture the relevant information in empirical data with respect to hypothesis class or
parametric distributions. Such functions were termed "minimal sufficient statistics" in
parametric inference and are known to exist, with fixed dimensionality, only for
distribution families of exponential form. A principled information theoretic
generalization of minimum sufficient statistics was proposed by the information bottleneck
method (IB), based on the data processing inequality for mutual information: extract
variables that minimize the mutual information between the sample and the statistics, while
constraining the mutual information between the statistics and the relevant variables
(e.g., the distribution parameters). This optimization problem is in general non-convex and
its optimal solutions can be obtain by an alternating projections algorithm only locally.
The IB problem was shown, however, to be efficiently globally solvable for the special case
of multivariate Gaussian variables (GIB). In this case it provides an information theoretic
generalization of Canonical Correlation Analysis (CCA) and established interesting
connections between CCA, channels with side information, and approximate minimal sufficient
statistics with continuous tradeoff between the accuracy and complexity. In this talk I
will describe a recent extension of the GIB, using Vapnik's Kernel trick, that makes the IB
as practical to any data for which Kernels can be applied. This new version of the IB
corresponds to the information theoretic Kernel-CCA, and makes the IB algorithm and the
systematic calculation of information curves - the optimal tradeoff between complexity
and accuracy of empirical data - completely practical even for very large empirical data.
Based on joint work with Nori Jacoby.
Masashi Sugiyama - Tokyo Institute of Technology
Density Ratio Estimation: A New Versatile Tool for Machine
10, 2011 11:20 - 12:00
In statistical machine learning, avoiding density estimation is
essential since it is often more difficult than solving a target
machine learning problem itself. This is often referred to as
"Vapnik's principle", and the support vector machine is one of the
successful examples of this principle. Following this spirit, we
recently introduced a new machine learning framework based on the
ratio of two probability density functions. This density-ratio
formulation includes various important machine learning tasks such as
non-stationarity adaptation, outlier detection, feature selection,
clustering, and conditional density estimation. Then, by direct
estimating the density ratio without going through density estimation,
all the above tasks can be effectively and efficiently solved in a
In this talk, I give an overview of recent advances in theory,
algorithms, and application of density ratio estimation.
Koji Tsuda - National Institute of Advanced Industrial Science and Technology
Fast Graph Search with Succinct
2011 13:30 - 14:10
In the last 10-15 years there has been a great increase of interest in
space-efficient (succinct) data structuresthat are compressed up to
the information theoretic lower bound. Compared to pointer-based naive
data structures, the memory usage can be smaller up to 20-30 fold. I
briefly present basics of succinct data structures andour recent work
of indexing 25 million chemical graphs for similarity search in memory.
Gunnar Raetsch - Friedrich Miescher Laboratory
Dec 10, 2011 14:10 -
Olivier Bousquet - Google
Dec 10, 2011 14:50 -
Andre Elisseeff - Nhumi Technologies
Two Statistical Challenges in Medical
Dec 10, 2011 16:00 -
Application developers in medical informatics are regularly faced with
uncertainty. Medical Data is noisy and requires statistical tricks to
extract relevant information. Data visualization can also be uncertain: it
relies sometimes on statistics to decide what the users want to see. This
presentation will address some of the tricks used in medical software to
handle such noisy situations. We will introduce two applications where
noise, bias and uncertainty are currently handled by simple statistical
tests and where machine learning approaches could also be applied.
Joaquin Quiñonero Candela - Microsoft Research
Click Prediction in Computational
Dec 10, 2011 16:30 -
Mingmin Chi - Fudan University
Chinese Stock Mining via Topic
10, 2011 17:00 - 17:30
Currently, there are more than 2,000 stocks in the Chinese stock market and usually
four new ones are IPO per week. Officially, those are divided into different sectors based on four systems:
China's Securities Regulatory Commission, e.g., forestry, fishery and agricultural industries; Concepts, e.g.,
new energy, internet of things, etc.; Regions, e.g., Shanghai, Beijing, Tibet, etc.; and Industry, e.g., Real
estate, Steel & Iron, Auto, etc. Also, a large amount of different kinds of financial and related political
news are released per day. Different pieces of news can be connected to the related sectors or stocks. However,
there are usually no explicit words or terms which point to the sectors or stocks. How to automatically dig out
the related stocks in terms of the large amount of news articles is of a highly challenging task for retail/institutional
investors. In the paper, we propose to use topic models automatically to generate the “topics” (or sectors) which
are implicitly related to the associated stocks. The preliminary results are shown in the experiments and two new
topic models are also given for further investigation.
Matthew Blaschko - Ecole Centrale Paris
Ranking and Structured Output
Dec 10, 2011 17:30 -