Your browser does not support JavaScript!

Home    Automated machine learning on high-dimensional biological data  

Results - Details

Add to Basket
[Add to Basket]
Identifier 000421849
Title Automated machine learning on high-dimensional biological data
Alternative Title Αυτοματοποιημένη μηχανική μάθηση σε βιολογικάδεδομένα υψηλών διαστάσεων
Author Παπαδοβασιλάκης, Ζαχαρίας
Thesis advisor Τσαμαρδινός, Ιωάννης
Reviewer Ποίράζη Παναγιώτα
Ποταμιάς, Γεώργιος
Abstract Since their first application, Genome Wide Association Studies have evolved significantly and provided useful insight on medical diagnostics. Their main aim is to establish a connection between a variety of traits such as, human diseases or protein concentration levels, and the genetic background (usually via point mutations) of a given species. Problems of this type suffer from several issues primarily caused by high dimensionality (millions of Single Nucleotide Polymorhisms), low sample size, need of multiple testing correction and taking into account population structure. In this thesis, we address current GWAS methodological issues utilizing a feature selection method, termed generalized Orthogonal Matching Pursuit (gOMP). gOMP offers a variety of advantageous characteristics such as a) computational efficiency and scalability to number of features, b) adaptability to any type of outcome variable (e.g. binary, continuous, time-to-event etc) and c) simplicity in terms of implementation. gOMP can also be fully integrated into JAD Bio’sTM automated machine learning pipeline which ensures methodological correctness in terms of proper model-building procedure and unbiased predictive performance estimation. On top of that, we extend gOMP’s functionality by a) parallelizing its operation feature-wise and b) identifying features that are statistically equivalent to the already selected ones. Regarding equivalent features, we argue that the produced multiple solutions are able to capture and correct the underlying population structure. In order to evaluate gOMP’s performance, we extensively compare it with QTCAT over a series of simulated datasets. Additionally, we apply gOMP to real human-disease datasets. As a result, gOMP proves to be a highly efficient method for genomic datasets in terms of performance, retrieval of associated features and computational cost.
Language English
Subject High -dimensionality
Issue date 2019-03-27
Collection   School/Department--School of Medicine--Department of Medicine--Post-graduate theses
  Type of Work--Post-graduate theses
Views 188

Digital Documents
No preview available

Download document
View document
Views : 5