scRNA Seq integration, cluster annotation, gene set analysis

2 minute read

repository link

Background and Objective

scRNA-Seq data was generated from 8 high-grade glioma tumor patients (Yuan et al., 2018) paper. In this study, datasets were explored separately, and the findings were pooled for a unified conclusion. Here, we will integrate the datasets, accounting for batch effects using mutual nearest neighbhors or Seurat’s anchors-based MNN. Next, we will use information from the combined dataset to identify cell types and markers & differentially enriched gene sets (see Cluster Annotation below).

Batch normalization for integrating 8 scRNA Seq datasets

We will explore 2 workflows:

  1. Mutual nearest neighbors from Batchelor package using SingleCellExperiment objects and workflow (code)
  2. Anchors from Seurat.(code)

Results from integrating 8 scRNA Seq datasets

Uncorrected
uncorrected-batch-umap

Mutual nearest neighbors (Haghverdi, 2018)
MNN-integrated-umap

Seurat anchors (Stuart,2019)
seurat-integrated-umap

Seurat Clustering

seurat-cluster_number-umap

Cell type annotation using CellMarkers

CellMarker (paper) is manually curated from over 100k papers, including >13k cell markers of 467 cell types spanning 158 tissues (db link). This dataset can easily be downloaded and compared against the top markers from each cluster. Here, we used Human cell markers (Human_cell_markers.txt), which assigns single genes (and respective proteins) to a corresponding cell and tissue type. As another option, Single cell markers (Single_cell_markers.txt), is derived from scRNA seq data and assigns multiple markers per cell type.

Using the CellMarker database, clusters: 1, 5, 7, 9, 10 and 11 were characterized by cell labels such as “Cancer stem cell” and/or “mesenchymal progenitor cell”. Interestingly, these cancer subpopulations are enriched with different combinations of markers, such as CD44, CXCR4, ICAM1, CD24, MCAM, ABCG2, FLT1/VEGFR1.

cancer_markers_dotplot

How are these markers reflected in the literature? Indeed, CD44 is expressed at high levels on tumor cells at the periphery (1). CD24+ tumor cells are known to be highly proliferative, migratory and invasive (2). And, CXCR4 is expressed in a subset of glioma cells with enhanced tumorgenicity(3). ABCG2 is a drug transporter that is overexpressed in glioma stem cells in comparison to astrocytes. Interestingly, increased ABCG2 expression reduced chemotherapy drug buildup, which suggests this subpopulation (cluster 10) may be more resistance to chemotherapy (4). Cluster 10 was also enriched for FLT1/VEGFR1 and PECAM1 expression, suggesting this subpopulation is of angiogenic nature (5).

Gene set enrichment analysis using AUCell

As another approach, we tested for gene set enrichment (using the Gene Ontology annotation) using AUCell. In AUCell, gene expression rankings are built for each cell. Next, enrichment of each gene set is plotted against each cell’s gene expression ranking (the top 5% of expressed genes). The calculated AUC indicates the proportion of genes in a gene set that are highly expressed (for a given cell).

AUCell_gene_sets

Interestingly, several gene sets of biological relevance were significantly enriched. Previous work has identified IL-8 in the glioblastoma secretome, where it stimulates pro-angiogenic signals and endothelial permeability(6). This is further supported by the enrichment of the blood vessel remodeling gene set. In addition, we identified 2 immune populations: T cells and antigen presenting cells (likely microglia). Microglia are well-known drivers of brain tumor pathobiology and can make up 1/3 of the tumor(7) These results suggest that gene set enrichment can be used to identify aberrant pathways in tumor subpopulations, and can be used to study cell and pathway usage heterogeneity.