Abstract |
In recent years, data has an exponential growth in both the number of instances and the
number of features, which brings their scale to the level of terabytes. These amounts of
data can be found in many machine learning applications like
information retrieval, text
categorization and image retrieval. Although such amounts of data are very frequent
nowadays, classical machine learning
algorithms cannot handle them.
A very important task in machine learning is feature selection and its task
is to select the
most informative features in a dataset. Feature selection is effective in reducing
dimensionality, removing irrelevant features, increasing performance of a learner, and
improving our understanding of the data. With the increase of the volume of the data the
usability of classical feature selection algorithms significantly deteriorates.
To solve scalability problems, the Map-Reduce model has been proposed. With this model
the data can be processed in parallel in a cluster, and so machine learning algorithms can
now be altered in order to process terabytes of data.
In this thesis we were concerned with the implementation of a feature selection algorithm
for big data. More particularly, we used the Map-Reduce model to parallelize the Max-Min Parent and Children (MMPC) algorithm in order to be able to handle big data. MMPC
tries, heuristically, with the use of independent tests, to find dependencies among the
features. For this thesis we show how two independence tests that can handle categorical
and continuous features, can be used with the Map-Reduce model. Finally, we also use a
method so that MMPC can be used with any independence test using the Map-Reduce
model.
To evaluate our algorithm, we experimented with datasets that contained different
number of instances and features. The experimental evaluation showed that our
algorithm scales well with these datasets when varying the number of instances and the
number of nodes in the cluster. Moreover, the performance of the algorithm is
comparable to other feature selection algorithms.
|