Abstract |
Cellular commitment and differentiation in multicellular organisms depend on the concerted
action of transcription factors and epigenetic modifications in regulating differential patterns
of gene expression. Understanding the molecular basis of such complex regulatory events
has been greatly facilitated in recent years by the advent of next generation sequencing
(NGS) technologies for the high throughput, high resolution, genome wide characterization
of a multitude of transcriptional and epigenetic regulatory factors in various cellular and
developmental models. Deciphering these data in extracting biological meaning has been a
major challenge in the application of NGS technologies in gene regulation.
Our main research interest is to elucidate the transcriptional regulatory events underlying
hematopoiesis by specifically focusing on red blood cell differentiation (erythropoiesis). To
these ends, the work described here entails the development of a computational approach
in analyzing and integrating a large number of comprehensive NGS datasets of multiple
genomic characteristics (transcription factor binding, epigenetic modifications etc.) in
murine and human hematopoiesis. Our computational analysis relies on the combination of
supervised (RandomForest regression modeling) and unsupervised (hierarchical clustering)
machine learning approaches, in producing highly structured gene wide distribution patterns
of chromatin features in different hematopoietic cell populations.
We fist applied this approach in characterizing the genome-wide occupancy profiles of
the master erythroid transcription factor GATA1 which we obtained in mouse fetal liver
erythropoiesis (Papadopoulos et al., 2013). Integration of GATA1 occupancy profiles with
available genome-wide transcription factor and epigenetic profiles in fetal liver erythroid cells,
showed that GATA1 binding preferentially associates with specific epigenetic modifications,
such as H4K16Ac and H3K27Ac or H3K4me2. Furthermore, we were able to classify GATA1
target genes into three distinct clusters, each associated with specific epigenetic signatures and
functional characteristics, thus suggesting distinct GATA1 associated regulatory mechanisms.
Next, we applied our computational approach in utilizing available genomic data to
investigate the differential transcriptional and epigenetic events underlying the specification
of the erythroid and megakaryocytic lineages, deriving from a common progenitor. We
identified a large group (~1000) of genes with active promoter marks in hematopoietic stem cell (LSK cells), which become specifically inactive in erythroid cells but not in
megakaryocytes. Comparison of DNase hypersensitivity profiles available for all erythroid
differentiation stages, indicated that inactivation of these promoters initiates before the
stage of early erythroid commitment (CD71+/Ter119- cells), thus representing an early
step of the erythroid specification process. By comparing expression profiles of erythroidmegakaryocytic
progenitors (MEPs), erythroid cells and megakaryocytes, we also identified
erythroid specific epigenetic modifiers that may serve as candidates in regulating erasure of
this epigenetic signature in erythroid cells.
We also focused on the genome wide occupancies of transcription factors with essential
functions in both erythroid and megakaryocytic differentiation, such as GATA1, GATA2,
TAL1 and LDB1. By analyzing genome wide occupancies in LSK (HSCs), Ter119+ (erythroid)
and CD41+ (megakaryocytic) cells, we found, firstly, that GATA1 binding patterns in
erythroid and megakaryocytic cells appear to be largely distinct. Secondly, we found that the
GATA1 erythroid specific binding profile is closely reflected by the TAL1 and LDB1 binding
profiles in LSK cells, thus showing upstream specification of erythroid GATA1 binding by
TAL1/LDB1.
Finally, we developed Ariadne (aegeas.imbb.forth.gr/Ariadne/) as a web based comprehensive
tool to compare gene-wide relational profiles of multiple NGS datasets analyzed
using our computational approach and in order to visualize primary sequencing data within
single gene loci.
|