Abstract |
Since their first application, Genome Wide Association Studies have evolved
significantly and provided useful insight on medical diagnostics. Their main aim
is to establish a connection between a variety of traits such as, human diseases or
protein concentration levels, and the genetic background (usually via point mutations)
of a given species. Problems of this type suffer from several issues primarily
caused by high dimensionality (millions of Single Nucleotide Polymorhisms), low
sample size, need of multiple testing correction and taking into account population
structure.
In this thesis, we address current GWAS methodological issues utilizing a feature
selection method, termed generalized Orthogonal Matching Pursuit (gOMP).
gOMP offers a variety of advantageous characteristics such as a) computational
efficiency and scalability to number of features, b) adaptability to any type of
outcome variable (e.g. binary, continuous, time-to-event etc) and c) simplicity in
terms of implementation. gOMP can also be fully integrated into JAD Bio’sTM
automated machine learning pipeline which ensures methodological correctness in
terms of proper model-building procedure and unbiased predictive performance estimation.
On top of that, we extend gOMP’s functionality by a) parallelizing its
operation feature-wise and b) identifying features that are statistically equivalent
to the already selected ones. Regarding equivalent features, we argue that the produced
multiple solutions are able to capture and correct the underlying population
structure. In order to evaluate gOMP’s performance, we extensively compare it
with QTCAT over a series of simulated datasets. Additionally, we apply gOMP
to real human-disease datasets. As a result, gOMP proves to be a highly efficient
method for genomic datasets in terms of performance, retrieval of associated
features and computational cost.
|