E-Locus - Institutional Repository of the University of Crete - Implementing feature selection algorithms for big data

Home Implementing feature selection algorithms for big data

Results - Details

[Add to Basket]

Identifier

000402832

Title

Implementing feature selection algorithms for big data

Alternative Title

Υλοποίηση αλγορίθμων επιλογής μεταβλητών για μεγάλο όγκο δεδομένων

Author

Τζιράκης, Παναγιώτης

Thesis advisor

Τσαμαρδίνος, Ιωάννης

Reviewer

Χριστοφίδης Βασίλειος
Χρυσάνθης, Παναγιώτης

Abstract

In recent years, data has an exponential growth in both the number of instances and the number of features, which brings their scale to the level of terabytes. These amounts of data can be found in many machine learning applications like information retrieval, text categorization and image retrieval. Although such amounts of data are very frequent nowadays, classical machine learning algorithms cannot handle them. A very important task in machine learning is feature selection and its task is to select the most informative features in a dataset. Feature selection is effective in reducing dimensionality, removing irrelevant features, increasing performance of a learner, and improving our understanding of the data. With the increase of the volume of the data the usability of classical feature selection algorithms significantly deteriorates. To solve scalability problems, the Map-Reduce model has been proposed. With this model the data can be processed in parallel in a cluster, and so machine learning algorithms can now be altered in order to process terabytes of data. In this thesis we were concerned with the implementation of a feature selection algorithm for big data. More particularly, we used the Map-Reduce model to parallelize the Max-Min Parent and Children (MMPC) algorithm in order to be able to handle big data. MMPC tries, heuristically, with the use of independent tests, to find dependencies among the features. For this thesis we show how two independence tests that can handle categorical and continuous features, can be used with the Map-Reduce model. Finally, we also use a method so that MMPC can be used with any independence test using the Map-Reduce model. To evaluate our algorithm, we experimented with datasets that contained different number of instances and features. The experimental evaluation showed that our algorithm scales well with these datasets when varying the number of instances and the number of nodes in the cluster. Moreover, the performance of the algorithm is comparable to other feature selection algorithms.

Language

English

Issue date

2015-11-20

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses

Type of Work--Post-graduate theses

Views

518

Digital Documents
	Download document View document Views : 25