Discriminative learning algorithms are typically trained from large collections of vectorial training examples. In many classical learning problems,
however, it is arguably more appropriate to represent training data not as individual data points, but as probability distributions. There are, in fact,
multiple reasons why probability measures may be preferable.
Firstly, uncertain or missing data naturally arises in many applications. For example, gene expression data obtained from the microarray experiments are known to be very noisy due to various sources of variabilities. In order to reduce uncertainty, and to allow for estimates of confidence levels, experiments are often replicated. Unfortunately, the feasibility of replicating the microarray experiments is often inhibited by cost constraints, as well as the amount of available mRNA. To cope with experimental uncertainty given a limited amount of data, it is natural to represent each array as a probability distribution that has been designed to approximate the variability of gene expressions across slides.
Probability distributions may be equally appropriate given an abundance of training data. In data-rich disciplines such as neuroinformatics, climate informatics, and astronomy, a high throughput experiment can easily generate a huge amount of data, leading to significant computational challenges in both time and space. Instead of scaling up ones learning algorithms, one can scale down ones dataset by constructing a smaller collection of distributions which represents groups of similar samples. Besides computational efficiency, aggregate statistics can potentially incorporate higher-level information that represents the collective behavior of multiple data points. In this research, we are developing kernel-based learning algorithms that are well-suited to the probability distributions.
Support measure machine (SMM) generalizes well-known support vector machine (SVM) to the space of probability distributions. That is, the training samples are not restricted to only vectorial data, but also groups of vectorial data points, or even the whole probability distributions.
The LIBSMM implementation is an extension of the LIBSVM. Click the link below to download the current version of LIBSMM.
Note that the beta version of LIBSMM is under development. Please feel free to contact us if you experience any problem with the code.
The current version of LIBSMM works on linux. It may also work on other operating system as well, but is not guaranteed. We are working on extension of the LIBSMM. To install the LIBSMM, type make all in the command line. If you have any problem using the code, please feel free to contact firstname.lastname@example.org.
Since LIBSMM is an extension of LIBSVM implementation, the original copyright of LIBSVM is retained. Please read the COPYRIGHT notice before using LIBSMM.
After installing the LIBSMM on your machine, it should be straightforward for those who are familiar with the LIBSVM to use the software. Nevertheless, there are few things you need to keep in mind.
There are three ways you can give the input to the LIBSMM.
Given an individual sample, we perform a usual SVM, which can also be thought of as the SMM on the dirac measure defined over the training sample.
In this case, we follow the standard input format of the LIBSVM.
The format of training and testing data file is:
(label) (index1):(value1) (index2):(value2) ... . . .
Each line contains an instance and is ended by a '\n' character. For classification, (label) is an integer indicating the class label (multi-class is supported). For regression, (label) is the target value which can be any real number. The pair (index):(value) gives a feature (attribute) value: (index) is an integer starting from 1 and (value) is a real number. The only exception is the precomputed kernel, where (index) starts from 0; see the section of precomputed kernels. Indices must be in ASCENDING order. Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any numbers.
The LIBSMM can work directly with probability distributions. The current version of LIBSMM only supports the Gaussian distribution, which is parametrized by the mean vector and covariance matrix. The format of training and testing data file is:
(label) (index1):(m_value_1) (index2):(m_value_2) ... (label) (index1):(C_value_11) (index2):(C_value_12) ... (label) (index1):(C_value_21) (index2):(C_value_22) ... . . . (label) (index1):(C_value_D1) (index2):(C_value_D2) ... . . .
Each input distribution is composed of two parts; mean vector and covariance matrix. The first line contains an instance representing the mean vector. The following lines contain the rows of the covariance matrix. The number of rows corresponds to dimensionality of the input space. The label of the first line is used as a label for the whole distribution.
In many real-world applications, we do not know the true distributions underlying the data-generating process. Instead, we have i.i.d. samples drawn from those distributions. In this case, we are working with the empirical distributions associated with the sample sets. These empirical distributions become more representative of the true distributions as the number of samples increases. The input format of LIBSMM in this case is similar to the first input format, where we have individual samples.
(label) (group_index) (index1):(value1) (index2):(value2) ... . . .
In addition to the (label) and (index):(value) pairs, we need to specify the group to which each sample belongs. Group indices must start from 1 be in ASCENDING order. The samples from the same distribution must be put together in a contiguous block. The label of the first sample in the group is used as a label for the whole distribution.
Once you have prepared the training set according to the specified format, there are two executable programs that you can use to train an SMM and make a prediction.
The smm-train program read the training set file and output the SMM model, which will be used to make a prediction.
Usage: smm-train [options] training_set_file [model_file] options: -f mode : set learning mode (default 0) 0 -- SMM 1 -- SVM -i input_type : set type of input (default 0) 0 -- empirical samples 1 -- distributions (Gaussian) -s svm_type : set type of SVM (default 0) 0 -- C-SVC 1 -- nu-SVC 2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR -t kernel_type : set type of kernel function [embedding kernel for SMM] (default 2) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) 3 -- sigmoid: tanh(gamma*u'*v + coef0) 4 -- precomputed kernel (kernel values in training_set_file) -l level-2_kernel : set type of level-2 kernel function (default 0) 0 -- linear:
1 -- polynomial: (gamma2*
+ coef0)^degree2 2 -- radial basis function: exp(-gamma2*|P-Q|^2) 3 -- sigmoid: tanh(gamma2*
+ coef0) -d degree : set degree in kernel function (default 3) -j degree2 : set degree in level-2 kernel function (default 3) -g gamma : set gamma in kernel function (default 1/num_features) -k gamma2 : set gamma in level-2 kernel function (default 1/num_features) -r coef0 : set coef0 in kernel function (default 0) -o coef02 : set coef0 in level-2 kernel function (default 0) -c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1) -n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5) -p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1) -m cachesize : set cache memory size in MB (default 100) -e epsilon : set tolerance of termination criterion (default 0.001) -h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1) -b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0) -wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1) -W weight_file : use the weighted instances in the weight_file -v n: n-fold cross validation mode -q : quiet mode (no outputs)
The smm-predict reads the test file and the SMM model file outputed by the smm-train, and make a prediction.
Usage: smm-predict [options] test_file model_file output_file options: -b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for one-class SMM only 0 is supported
You will find two example input files, namely smm_distribution_toy.dat and smm_empirical_strong.dat in the svm-toy folder. To train an SMM on these small datasets, type
./smm-train -f 0 -i 1 -t 2 -l 0 smm_distribution_toy.dat ./smm-train -f 0 -i 0 -t 2 -l 0 smm_empirical_toy.dat
for two different input types. In the above example, we use a linear level-2 kernel and a Gaussian RBF embedding kernel. After training, you will get the model files, namely smm_distribution_toy.dat.model and smm_empirical_toy.dat.model. To make a prediction using these models, type
./smm-predict smm_distribution_toy.dat smm_distribution_toy.dat.model output_distribution.txt ./smm-predict smm_empirical_toy.dat smm_empirical_toy.dat.model output_empirical.txt
In these examples, we make a prediction on the training set. It is also possible to make a prediction on an unseen dataset. In such case, the input type of a test set must match that of the training set used to construct the model.
These people have made a considerable amount of contrbution to this project:
|Krikamol Muandet||Max Planck Institute for Intelligent Systems|
|Kenji Fukumizu||The Institute of Mathematical Statistics|
|Francesco Dinuzzo||Max Planck Institute for Intelligent Systems|
|Bernhard Schölkopf||Max Planck Institute for Intelligent Systems|
If you would like to contribute to the project in anyway, please do not hesitate to contact us. We would love to hear from you.
If you have questions, please feel free to send us an email. Below is the contact person. Comments and suggestions are always welcome!
Department of Empirical Inference,
Max Planck Institute for Intelligent Systems,
Spemannstrasse 38, 72076 Tübingen, Germany
Telephone: +49-(0)7071 601 554