Single cell RNA-seq Analysis of Lymphocytes – How far it differs from Bulk RNA-seq Analysis?

In this post, we present a pipeline for scRNA-seq analysis of lymphocytes (B-cells, T-cells and NK-cells) using the “census of immune cells” dataset from the Human Cell Atlas project. We will walk through the analysis step-by-step and mention the main differences between scRNA-seq and bulk RNA-seq.

1. Quality Control

High quality libraries are considered as a prerequisite to scRNA-seq analysis to ensure accurate results and avoid misleading outcomes. Low-quality libraries in scRNA-seq data are usually due to cell damage and could be detected by identifying cells with small library size or few expressed genes.

This dataset contains 33500 gene expressions for 380,000 cells, generated using the 10X Genomics technology. We may use the terms “features” and “expressed genes” interchangeably, also the terms “library” and “cell” could be also used interchangeably. We started with per cell QC, the library size is the first quality metric and defined as the total sum of counts across all gene expressions for each cell. Knowing that, cells with small library size are of low quality as they signal RNA damage (Figure 1 – Left). The number of features per cell is the second quality metric and defined as the total number of genes for each cell with non-zero gene expression counts (Figure 1 – Right).

Figure 1. Distribution of per cell QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored by whether the cell was discarded due small library size (Left) or few expressed genes (Right).

After that, we performed per feature QC. This quality metric calculates for each feature, the percentage of cells with non-zero gene expression counts. We discarded features that were not expressed in any cell (Figure 2).

Figure 2. Distribution of per feature QC metrics in the HCA bone marrow dataset. Each point represents a feature and is colored by whether the feature was detected.

2. Cell Classification

The goal of this post is to analyze the differences in gene expression profiles between lymphocytes (B-cells, T-cells and NK-cells) and identify top marker genes for each cell type. Therefore, we needed to classify and label the cells in the HCA bone marrow dataset with their cell types to extract lymphocytes only for downstream analysis.

We used an automated approach (Aran et al. 2019) to compare our gene expression profiles for each cell to a reference dataset that has already been accurately annotated by domain experts (Mabbott et al. 2013). Table 1. shows the labeled cell counts per individual donor.

MantonBM1MantonBM2MantonBM3MantonBM4MantonBM5MantonBM6MantonBM7MantonBM8
B cells616411612346244446598420358454397
T cells1896519130193951225313729196302237422867
NK cells17656595408747812274241641302227
Chondrocytes21021011
CMP5381144338788698325601029
DC13523816232
Endothelial cells41612321
Erythroblast101351038617313310750809718
Fibroblasts00000001
GMP78217677015001492868505948
Hepatocytes00010000
HSC G-CSF15916541461271059284
HSC CD34+130112017747
Macrophage 34525335450196
MEP7562015099231462674415692
Monocyte472743589484116368741539048156673
Myelocyte 5263078104661335
Pro Myelocyte 3074816941455018091163
Neutrophils 1356753509318
Osteoblasts40440701
Platelets73101573857188158
Pre B cells CD34- 12664049362606197410956251088
Pro B cells CD34+2336130524218095867832382838
Smooth muscle cells53104731
Tissue stem cells65953393426199
Table 1. Number of classified cells in each cell type (row) from each donor sample (column).

An interesting visualization (Figure 3) provides more insights about cell populations of each donor. For instance, we may notice slightly higher count of B cells for the second donor “MantonBM2” and higher counts of T cells for the last two donors “MantonBM7” and “MantonBM8”.

Figure 3. Heatmap of number of cells in each cell type (row) from each donor sample (column).

3. Dimensionality Reduction and Graph-based Clustering

scRNA-seq analysis compares large number of cells based on their transcriptomic profiles. One analytical approach involves clustering to identify cells with similar transcriptomic profiles then performing differential gene expression analysis between clusters. Given that, each individual gene represents a dimension of the transcriptomic data, dimensionality reduction of features followed by graph-based clustering may come in as a handy solution prior differential gene expression analysis.

Initially, we computed the log-transformed normalized expression values from the expression counts, followed by dimensionality reduction using the uniform manifold approximation and projection (UMAP) method (McInnes, Healy, and Melville 2018). Finally, a “two-step”  clustering, k-means was initially used to obtain representative centroids (step-1) that are subjected to a nearest-neighbor graph-based community detection method (step-2).

We identified 15 clusters that seem to be interestingly separated when colored by cell type (Figure 4); clusters 2, 9, 13, 14 belong to B cells, clusters 1, 3, 4, 5, 7, 8, 10, 11, 12 belong to T cells, and cluster 6 belong to NK cells. Cluster 15 lies between B and T cells and contains a population of B cells and small amount of T cells.

Figure 4. UMAP two dimensional plot of the lymphocyte cells after graph-based clustering. Each point represents a cell and is colored according to the cell type.

We selected clusters 2, 9, 13 as representative clusters of B cells, clusters 4, 7, 8 as representative clusters of T cells, and cluster 6 as a representative cluster of NK cells. We merged and identified the centroid of these representative clusters then selected 10,000 nearest neighbors to each centroid using k-nearest neighbor algorithm. The main goal was to obtain three equal sized and more compact clusters of each cell type and free from ambiguous cells (Figure 5). These 3 clusters can be used for downstream analysis in order to identify marker genes differentiating B cells, T cells, and NK cells.

Figure 5. UMAP two dimensional plot of the selected lymphocyte cells. Each point represents a cell and is colored according to the cell type.

4. Differential Gene Expression Analysis

We tested for differential gene expressions between pairs of clusters using pairwise t-tests to identify marker genes for B cells, T cells and NK cells. We selected the top 15 marker genes from each pairwise comparison as a brief example in order to visualize and interpret the results (Figure 6).

Figure 6. Heatmap of log2-fold changes for the top marker genes (rows) of clusters of B cells, T cells, and NK cells (columns).

The first observation that top marker genes show similar expression variations in T cells and NK cells but expression levels are different. The top two genes that clearly distinguish NK cells from T cells and B cells are NKG7 and GNLY; NKG7 is a protein coding gene known as Natural Killer Cell Granule Protein 7 and GNLY is a member of the saposin-like protein family (SAPLIP) and is located in natural killer cells, Innate Immune Systems is among its related pathways. On the other hand, TRAC, CD3D and CD3E distinguish T cells from NK cells and B cells; TRAC is a protein coding gene known as T Cell Receptor Alpha Constant, CD3D and CD3E are protein coding genes that are part of the T-cell receptor/CD3 complex.

B cells top marker genes show different expression patterns from T cells and NK cells. Immunoglobulin genes were expressed as B cells marker genes such as IGKC (Immunoglobulin Kappa Constant) and IGHM (Immunoglobulin Heavy Constant Mu). The most noticeable marker genes were CD74 and HLA class II beta chain paralogs (HLA-DRA, HLA-DRB1, HLA-DPB1, and HLA-DQB1); HLA class II molecules play an important role in the immune response by presenting peptides on the surface of various antigen presenting cells such as B lymphocytes while CD74 is a protein coding gene that regulates antigen presentation for immune responses.

5. Analysis of scRNA-seq data versus bulk RNA-seq data

scRNA-seq generates the transcriptomic profile of each cell in a tissue sample and captures its heterogeneity at the single cell level while bulk-RNA-seq averages gene expression across a population of cells in a tissue sample. Bulk RNA-seq is useful in differential gene expression analysis between samples of the same tissue in different study conditions such case-control studies. The main advantage of scRNA-seq is the ability to analyze phenotypes at single cell resolution, it permits defining biological questions when transcriptomic cell-specific changes needs to be profiled and clearly understood such as: identification of cell types within heterogeneous tissue sample, quantification of heterogeneous cell responses in different study conditions, estimation of stochasticity and distribution of gene expressions in homogenous cells.

Published by Ahmed Tawfik

Founder of Tawftics

Leave a comment