E-Locus - Institutional Repository of the University of Crete - Automated machine learning on high-dimensional biological data

Home Automated machine learning on high-dimensional biological data

Results - Details

[Add to Basket]

Identifier

000421849

Title

Automated machine learning on high-dimensional biological data

Alternative Title

Αυτοματοποιημένη μηχανική μάθηση σε βιολογικάδεδομένα υψηλών διαστάσεων

Author

Παπαδοβασιλάκης, Ζαχαρίας

Thesis advisor

Τσαμαρδινός, Ιωάννης

Reviewer

Ποίράζη Παναγιώτα
Ποταμιάς, Γεώργιος

Abstract

Since their first application, Genome Wide Association Studies have evolved significantly and provided useful insight on medical diagnostics. Their main aim is to establish a connection between a variety of traits such as, human diseases or protein concentration levels, and the genetic background (usually via point mutations) of a given species. Problems of this type suffer from several issues primarily caused by high dimensionality (millions of Single Nucleotide Polymorhisms), low sample size, need of multiple testing correction and taking into account population structure. In this thesis, we address current GWAS methodological issues utilizing a feature selection method, termed generalized Orthogonal Matching Pursuit (gOMP). gOMP offers a variety of advantageous characteristics such as a) computational efficiency and scalability to number of features, b) adaptability to any type of outcome variable (e.g. binary, continuous, time-to-event etc) and c) simplicity in terms of implementation. gOMP can also be fully integrated into JAD Bio’sTM automated machine learning pipeline which ensures methodological correctness in terms of proper model-building procedure and unbiased predictive performance estimation. On top of that, we extend gOMP’s functionality by a) parallelizing its operation feature-wise and b) identifying features that are statistically equivalent to the already selected ones. Regarding equivalent features, we argue that the produced multiple solutions are able to capture and correct the underlying population structure. In order to evaluate gOMP’s performance, we extensively compare it with QTCAT over a series of simulated datasets. Additionally, we apply gOMP to real human-disease datasets. As a result, gOMP proves to be a highly efficient method for genomic datasets in terms of performance, retrieval of associated features and computational cost.

Language

English

Subject

High -dimensionality

Issue date

2019-03-27

Collection

School/Department--School of Medicine--Department of Medicine--Post-graduate theses

Type of Work--Post-graduate theses

Views

188

Digital Documents
	Download document View document Views : 5