Abstract |
Microbial communities are a cornerstone for most ecosystem types. To elucidate the
mechanisms governing such assemblages, it is fundamental to identify the taxa present
(who) and the processes that occur (what) in the various environments (where). Thanks
to a series of technological breakthroughs vast amounts of information/data from all
the various levels of the biological organization have been accumulated over the last
decades. In this context, microbial ecology studies are now relying on bioinformatics
methods and analyses. Therefore, a great number of challenges both from the biologistand
the computer scientist point-of-view have arisen; one among the most emerging
ones being: "what shall we do with all these pieces of information?". The paradigm of
Systems Biology addresses this challenge bymoving from reductionism tomore holistic
approaches attempting to interpret how the properties of a system emerge.
Aim of this PhD was to enhance microbiome data analyses by developing software
addressing on-going computational challenges on the study of microbial communities.
On top of that, to exploit such state-of-the-art methods to study microbial assemblages in
extreme environments. To this end, the Tristomo marsh in Karpathos island (Greece), was
chosen as a study case.
Environmental DNA andmetabarcoding have been widely used to estimate the biodiversity
(the who) and the structure of communities. Vast amount of sequencing data
targeting certain marker genes depending the taxonomic group of interest become available
thanks to High Throughput Sequencing technologies. However, the bioinformatics
analysis of such data require multiple steps and parameter settings as well as increase
computing resources. Workflows along with computing infrastructures ease this need to a
great extent; in this nontion, a Pipeline for environmental DNAMetabarcoding Analysis
(PEMA) was developed (Chapter 2.1). However, eDNA metabarcoding has limitations
too. Cytochrome c oxidase subunit I (COI) marker gene is a commonly used marker gene,
especially in studies targeting eukaryotic taxa. It is well known that in COI studies a great
number of the derived Operational Taxonomic Unitss (OTUs) get no taxonomic hits. The
presence of pseudogenes but also of non-eukaryotic taxa among the amplicon data, with
the simultaneous absence of the latter fromthe most commonly-used reference databases
justify this phenomenon to a great extent. To identify such cases the Dark mAtteR iNvestigator
(DARN) software was developed; DARN makes use of a COI-oriented tree of life to
provide further insight to such known unknown sequences (Chapter 2.2).
Amplicon and shotgun metagenomics approaches along with the rest of the omics
technologies, have led to vast amount of data and metadata, recording the who, the what and the where. To enable optimal accessibility and usage of this information, a great number of databases, ontologies as well as community-standards have been developed.
By exploiting data integration techniques to bring such bits of information together, as
well as text mining methods to retrieve knowledge "hidden" among the billions of text
lines in already published literature, the PREGO knowledge-base returns thousands of
what - where - who potential associations (Chapter 3).
The driving question though is how the different microbial taxa ascertain their endurance
as part of a community. Metabolic interactions among the various taxa play a
decisive role for the composition of such assemblages. Genome-scale metabolic networks
(GEMs) enable the inference of such interactions. Random sampling on the flux space
of such metabolicmodels, provides a representation of the flux values a model can get
under various conditions. However, flux sampling is challenging from a computational
point of view, especially as the dimension of a metabolic model increases. To address
such challenges, a Python library called dingo was developed using aMultiphaseMonte
Carlo Sampling algorithm (Chapter 4).
Finally, sediment andmicrobial mat samples as well as microbial aggregates from a
hypersalinemarsh in Tristomo bay (Karpathos, Greece) were analyzed. Both amplicon
(16S rRNA) and shotgun sequencing datawere used to characterize the microbial structure
of the communities and environmental parameters (e.g. salinity, oxygen concentration)
were measured at the sampling sites. Key functions supporting life in such environments
were identified and metagenome-assembled genomes (MAGs) of novel species found
were built (Chapter 5).
Similar to microbial communities, bioinformatics methods tend to build assemblages
while "living" on your own is quite rare. The methods developed during this PhD project
combined with state-of-the-art methods anticipate to build a framework that enables
moving from the community to the species level and then back again to the one of the
community. Such a framework is described for the study of microbial interactions at
real-world communities.
|