Page 86 - 2021_10-Haematologica-web
P. 86
L. Grassi et al.
Introduction
Cell isolation, RNA extraction and library construction
The protocols used for cell isolation, RNA extraction and library construction are described in the Online Supplementary Material.
Bioinformatic analysis
An overview of the bioinformatic pipeline is shown in Online Supplementary Figure S1. To analyze the expression of known genes and transcripts, we trimmed reads with Trim Galore (v0.3.7; parameters “-q 15 -s 3 --length 30 -e 0.05”) and aligned them to Ensembl v75 of the human transcriptome with Bowtie14 (1.0.1; parameters “-a --best --strata -S -m 100 -X 500 - -chunkmbs 256 --nofw --fr”). Small RNA-sequencing reads were also trimmed with Trim Galore (v0.3.7; parameters “-f fastq -e 0.05 -q 15 -O 3”) and aligned to the miRBase (v21) human mature microRNA (miRNA) with RapMap (v 0.4.0) using the parameters “quasimap -c -s -z 0.9”. We used MMSEQ12 and MMDIFF13 (v1.0.10; default parameters) to estimate gene, tran- script and miRNA expression levels, and to identify features that were differentially expressed across cell types. This choice of methodology allowed us to obtain regularized transcript and gene-level posterior estimates of expression and the correspon- ding measures of posterior uncertainty, which could then be accounted for in the modeling of differential expression. For guided transcriptome assembly, we used STAR (v2.4.1c; param- eters “--runThreadN 8 --outStd SAM --outSAMtype BAM Unsorted --outSAMstrandField intronMotif”) to align trimmed reads to build GRCh37 of the reference human genome. We sorted the bam files by coordinate and indexed them with sam- tools (v 1.3.1).15 We performed guided transcriptome assembly for each sample using StringTie7 (v 1.3.4; parameters “-p 8 --rf - G Ensembl_75.gtf -v -l BPSTRG”). We also used StringTie to combine these transcriptomes into a single merged transcrip- tome, which we then compared to the annotations in Ensembl 75 using Gffcompare.16 We identified intergenic transcripts and filtered out the ones overlapping known transcripts annotated in Gencode (v19)17 and UCSC (v hg19)18 using the GenomicRanges package.19 We assessed the protein coding potential of the novel intergenic multi-exonic transcripts using the Coding-Potential Assessment Tool (CPAT) (v 1.2.4).20 We chose CPAT because of its superior accuracy relative to compet- ing methods.20 A coding potential >0.364 was considered to dis- criminate between protein-coding and non-coding transcripts, in accordance with the human-specific guidance in the CPAT manual (http://rna-cpat.sourceforge.net/). We estimated the expression levels of novel genes and transcripts using MMSEQ, as described above for known genes and transcripts. We com- puted the expression specificity parameter Tau21 to compare the cell type specificities of novel genes, known long non-coding RNA (lncRNA) and known protein-coding genes. We used the BioConductor R package “phastCons100way. UCSC.hg19”22 to obtain sequence conservation scores of novel genes, known lncRNA and known protein-coding genes. A detailed descrip- tion of the computational methods used to identify circRNA, compare their sequences to known sequences and quantify expression levels is given in the Online Supplementary Material.
Data availability
All data used in this manuscript are available from the European Genome-phenome Archive (EGA) (https://www.ebi.ac.uk/ega/dacs/ EGAC00001000135I). The dataset identities are listed in Online Supplementary Table S1. Links to the datasets at EGA are also avail- able from the BLUEPRINT data access portal (http://dcc.blueprint- epigenome.eu/#/datasets).
Knowledge of the transcriptional programs underpin- ning the diverse functions of hematopoietic cells is essen- tial for understanding how and when these functions are performed and for resolving the molecular bases of hema- tologic diseases. Thanks to its accessibility, blood is the tissue of choice for the implementation of novel assays in primary samples. Indeed, several studies aiming to char- acterize gene expression profiles in the post-genome era have been performed on increasingly purified primary hematopoietic cell populations.1-3 These studies used expression arrays and thus required prior specification of the sequences to be interrogated. The probed sequences were often derived from the analysis of a very limited number of tissues and cell types,4 despite the early discov- ery that transcription is widespread throughout the human genome.5 The introduction of high-throughput nucleic acid sequencing technologies6 has improved the assembly of the human genome and the annotation of transcriptomes therein, and has enabled a more compre- hensive analysis of gene expression using transcriptomic assembly approaches.7 The BLUEPRINT consortium8 was established to characterize the epigenetic state and tran- scriptional profile of different types of hematopoietic cells. Reference datasets for DNA methylation, histone modifi- cations and gene expression were generated from highly purified cell populations using state-of-the-art technolo- gies, in accordance with quality standards set by the International Human Epigenome Consortium.9 RNA- sequencing data from over 270 samples encompassing 55 cell types have been made publicly available (http://dcc.blueprint-epigenome.eu). A subset of these data has been described previously.10,11 Here, we present the analysis of 90 total RNA samples obtained from cord and adult peripheral blood, each consisting of one of 27 mature cell types and 32 small RNA samples, each consist- ing of one of 11 mature cell types. We used a Bayesian dif- ferential expression analysis approach12,13 to determine changes in the expression levels of genes and transcripts at lineage commitment stages and to identify cell type-spe- cific transcriptional signatures. We performed guided tran- scriptome reconstruction7 using total RNA-sequencing reads, identifying 645 multi-exonic transcripts originating from 400 intergenic novel genes. The majority of the novel transcripts had low protein coding potential and high cell type specificity. Additionally, we identified 55,187 circular RNA (circRNA), which also displayed high cell type speci- ficity, highlighting the potential role of non-coding tran- scripts in hematopoiesis. To enable exploration and reuse of the data by the biomedical community, we developed a web interface for plotting expression patterns of genes and transcripts and downloading normalized expression data (https://blueprint.haem.cam.ac.uk/bloodatlas/).
Methods
Ethical approval
Samples were obtained from National Health Service Blood and Transplant blood donors and from cord blood donations at Cambridge University Hospitals, following informed consent. Ethical approval was obtained for A Blueprint of Blood Cells (REC East of England 12/EE/0040).
2614
haematologica | 2021; 106(10)