Identifying cancer populations from scRNA-seq

3 minute read

Hackathon team: Sara Grimm, Jason Wang, Miko Liu, Matt Bernstein repository link

Background and Objective

Previous work by Matt Bernstein (and others) explored scRNA-seq data of 8 high-grade glioma tumor samples from “Single-cell transcriptome analysis of lineage diversity in high-grade glioma” by Yuan et al (PMID: 30041684) paper. See their repository for their results.

Our primary objective is to extend their analysis by developing a method to stratify cells in a given scRNA-seq dataset according to malignancy status. For this stratification we rely primarily on a CNV (copy number variation) metric.

We also hoped to (a) identify gene markers or gene sets correlated with malignancy status, and (b) associate specific cell types from clustered scRNA-seq data with malignancy status.

Workflow

Workflow Steps and Code Bits

Step 1: Seurat

For each tumor sample, run Seurat to extract the following: normalized read counts (SCT) per gene per cell, assigned cluster per cell, UMAP coordinates, average expression per gene per cluster, and marker genes per cluster. The R workflow using Seurat can be found at working_data/seurat_simple/{sampleID}/protocol-{sampleID}.Rtxt

Step 2: Process inferCNV data.

(Thanks to Matt B for providing inferCNV outputs for each of the 8 tumor samples using InferCNV.) From the ~.observations.txt and ~.references.txt files output from inferCNV, the summed and average |locusScore-medianScore| were calculated over all available loci to provide aggregate CNV metrics per cell.

  id = samples[S];
  txtfile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
  out=paste("cellID", "cellGroup", "aggregate", "locusCt", "median", sep="\t");
  write.table(out, file=txtfile, sep="\t", quote=F, append=FALSE, row.names=F, col.names=F);
  infileR=paste(id, ".infercnv.references.txt", sep="");
  dataR=read.delim(infileR, header=TRUE, row.names=1);
  nC=ncol(dataR);  # cells
  nL=nrow(dataR);  # loci
  medianR=median(unlist(dataR));
  for (N in 1:nC) {
    dd=dataR[,N];
    delta=abs(dd-medianR);
    deltaSum=sum(delta);
    thiscell=colnames(dataR)[N];
    out=paste(thiscell, "reference", deltaSum, nL, medianR, sep="\t");
    write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
  }
  infileT=paste(id, ".infercnv.observations.txt", sep="");
  dataT=read.delim(infileT, header=TRUE, row.names=1);
  nC=ncol(dataT);  # cells
  nL=nrow(dataT);  # loci
  medianT=median(unlist(dataT));
  for (N in 1:nC) {
    dd=dataT[,N];
    delta=abs(dd-medianT);
    deltaSum=sum(delta);
    thiscell=colnames(dataT)[N];
    out=paste(thiscell, "sample", deltaSum, nL, medianT, sep="\t");
    write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
  }

Then the distribution of the average |locusScore-medianScore| results were plotted to see if the reference and tumor samples were different. (They are!) After repeating over the 8 tumors, we noted that the distribution of these scores in the reference cells are consistently below 0.02, so we are using this as the threshold for what we are confident(???) are non-malignant cells. These plots can be viewed in working_data/inferCNV_distr.

  id = samples[S];
  infile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
  data=read.delim(infile, header=TRUE);
  ddR=subset(data, (data$cellGroup == "reference"));
  ddS=subset(data, (data$cellGroup == "sample"));
  loci=ddR[1,4];
  xmin=0; xmax=0.08;
  pngfile=paste(id, ".avg_medianDelta_per_cell.distr.png", sep="");
  png(pngfile, h=500, w=500, res=120);
  maintitle=paste(id, ": average |score-median|\n(", loci, " inferCNV loci)", sep="");
  plot(density(ddR$aggregate/loci), lwd=2, col="black", main=maintitle, xlab="", xlim=c(xmin,xmax), ylim=c(0,yy[S]));
  lines(density(ddS$aggregate/loci), lwd=2, col="limegreen");
  legend("topright", c("reference","sample"), col=c("black","limegreen"), pch=20);
  dev.off();

Step 3: Combine Seurat cell cluster and inferCNV data.

We then integrated the cell cluster assingments from Seurat with the aggregate inferCNV score we calculated, then visualized via UMAP to look at relative location of low/high CNV cells compared to cells with high expression of proliferation marker genes.

Screenshot Examples and Output

yellow: non-malignant
blue: malignant

PJ018 UMAP-byRankICNV UMAP_proliferation_PJ018

PJ030 UMAP-byRankICNV UMAP_proliferation_PJ030

Additional Visualizations

Annotation of cell clusters using PlangaoDB

Cells with aggregate inferCNV score in range of reference (assumed non-malignant (red)) cells.

Expression of selected markers

biomarker_expression-SOX2 biomarker_expression-OLIG2

Future Directions

Moving forward, we would like to add gene set enrichment analysis to use enriched gene sets consistent with the caner phenotype, including proliferation and developmental pathways. This would provide a more rigorous statistical score for each cluster as opposed to the proliferation score currently used.
Furthermore, testing other tools for annotating gene clusters with their respective cell types would aid in identifying non-malignant clusters.
Examine specific genes for abnormal CNV from the inferred CNV data (TP53, MDM2, RTK, RAS, PI3K, RB, CDK4)
Use identified cancer cells on the scRNA-Seq dataset integrating all 8 donors to identify an improved boundary (such as using supporter vector machines) rather than the principal components 1 & 2 defined in the paper.

Share on

Twitter Facebook Google+ LinkedIn

Jason Wang

Identifying cancer populations from scRNA-seq

Background and Objective

Workflow

Workflow Steps and Code Bits

Screenshot Examples and Output

Additional Visualizations

Annotation of cell clusters using PlangaoDB

Cells with aggregate inferCNV score in range of reference (assumed non-malignant (red)) cells.

Expression of selected markers

Future Directions

Share on

You May Also Enjoy

Multiomics data integration and prediction with mixomics

Streamlit app for image classification and model evaluation with load testing on GCP

AWS AutoML model generation with scheduled retraining via AWS Cloudwatch and Lambda

PySpark in AWS for predicting small molecule pIC50