Abstract |
The problem of feature selection can be defined as identifying a minimal subset of the
input variables that is optimally predictive for an outcome variable of interest. Feature
selection is a common component in supervised machine learning pipelines, and an
essential tool when the goal of the analysis is knowledge discovery. The latter is
especially important in domains such as molecular biology and life sciences, where a
researcher is interested in understanding the problem under study and not necessarily
in the resulting predictive model.
Feature selection is challenging: it has been shown to be NP-hard, and thus most
approaches rely on approximations to solve it efficiently. There exist many different
approaches to the feature selection problem, trading off generality (e.g., what types
of data and outcomes they can handle), computational cost, and theoretical
properties of optimality. Stepwise selection methods (e.g., forward-backward
selection) are quite general and optimal for a large class of distributions, but are
computationally expensive. Sparsity-based methods (e.g., LASSO) are computationally
efficient for some problems (e.g., classification and regression) and slow for others
(e.g., time-course data), and have strong theoretical guarantees. Information
theoretic approaches are computationally fast, but not as general (they only handle
discrete data) and with weaker theoretical guarantees. Another challenge is to scale
feature selection methods to very large datasets, which may contain millions of
samples and variables. Existing approaches are either too slow or perform poorly in
terms of predictive performance. Finally, most methods do not deal with the presence
of multiple solutions to the feature selection problem, which often exist in real data.
For example, it has been shown molecular data often contain multiple solutions,
possibly due to the redundancy present in the underlying biological system. Therefore,
although identifying a single solution is adequate for predictive purposes, it is not
sufficient when the focus is on knowledge discovery. On the contrary, reporting a
single solution and claiming that there are no other solutions is misleading.
In this thesis, we focus on forward-backward selection and propose several extensions
to tackle the above challenges. Forward-backward selection was chosen because of
its theoretical properties and generality, as it is applicable to different predictive tasks
and data types. We provide a unified view of several classes of feature selection
methods, such as sparsity-based, information theoretic, statistical and causal-based
methods, and show that they are instances or approximations of stepwise selection
methods. This allows one to translate and use techniques (such as the ones proposed
in this thesis) between different approaches to the feature selection problem. Then,
we propose a heuristic inspired by causal modeling to speed-up the forward-backward
selection algorithm, while preserving its theoretical properties. In experiments we
show that this leads to a speed-up of 1-2 orders of magnitude over the standard
forward-backward selection algorithm, while retaining its predictive performance.
Afterwards, we extend the algorithm for Big Data settings, enabling it to scale to data
with tens of millions of samples and variables.
In a comparison with alternative methods from the same algorithmic family, we show
that the proposed method significantly outperforms all competitors in terms of
running time, being the only method that is able to terminate on all datasets, and
without sacrificing predictive performance. Furthermore, in a comparison with
information theoretic methods we show that, although computationally slower, it is
able to produce significantly better predictive models. Finally, we deal with the
multiple feature selection problem. We show that the existing taxonomy of features
is misleading when multiple solutions are present, and propose an alternative
taxonomy that takes multiplicity into account. Then, we consider several definitions
of statistical equivalence of feature sets, as well as methods to test for equivalence of
feature sets. Afterwards, we propose a general strategy to extend forward-backward
type methods for identifying multiple, statistically equivalent solutions. We provide
conditions under which it is able to identify all equivalent solutions. In a comparison
with the only alternative method with the same theoretical guarantees, we show that
it produces similar results while being computationally faster.
|