Your browser does not support JavaScript!

Home    Implementing feature selection algorithms for big data  

Results - Details

Add to Basket
[Add to Basket]
Identifier 000402832
Title Implementing feature selection algorithms for big data
Alternative Title Υλοποίηση αλγορίθμων επιλογής μεταβλητών για μεγάλο όγκο δεδομένων
Author Τζιράκης, Παναγιώτης
Thesis advisor Τσαμαρδίνος, Ιωάννης
Reviewer Χριστοφίδης Βασίλειος
Χρυσάνθης, Παναγιώτης
Abstract In recent years, data has an exponential growth in both the number of instances and the number of features, which brings their scale to the level of terabytes. These amounts of data can be found in many machine learning applications like information retrieval, text categorization and image retrieval. Although such amounts of data are very frequent nowadays, classical machine learning algorithms cannot handle them. A very important task in machine learning is feature selection and its task is to select the most informative features in a dataset. Feature selection is effective in reducing dimensionality, removing irrelevant features, increasing performance of a learner, and improving our understanding of the data. With the increase of the volume of the data the usability of classical feature selection algorithms significantly deteriorates. To solve scalability problems, the Map-Reduce model has been proposed. With this model the data can be processed in parallel in a cluster, and so machine learning algorithms can now be altered in order to process terabytes of data. In this thesis we were concerned with the implementation of a feature selection algorithm for big data. More particularly, we used the Map-Reduce model to parallelize the Max-Min Parent and Children (MMPC) algorithm in order to be able to handle big data. MMPC tries, heuristically, with the use of independent tests, to find dependencies among the features. For this thesis we show how two independence tests that can handle categorical and continuous features, can be used with the Map-Reduce model. Finally, we also use a method so that MMPC can be used with any independence test using the Map-Reduce model. To evaluate our algorithm, we experimented with datasets that contained different number of instances and features. The experimental evaluation showed that our algorithm scales well with these datasets when varying the number of instances and the number of nodes in the cluster. Moreover, the performance of the algorithm is comparable to other feature selection algorithms.
Language English
Issue date 2015-11-20
Collection   School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses
  Type of Work--Post-graduate theses
Views 518

Digital Documents
No preview available

Download document
View document
Views : 25