Identifying cancer populations from scRNA-seq
Hackathon team: Sara Grimm, Jason Wang, Miko Liu, Matt Bernstein repository link
Background and Objective
Previous work by Matt Bernstein (and others) explored scRNA-seq data of 8 high-grade glioma tumor samples from “Single-cell transcriptome analysis of lineage diversity in high-grade glioma” by Yuan et al (PMID: 30041684) paper. See their repository for their results.
Our primary objective is to extend their analysis by developing a method to stratify cells in a given scRNA-seq dataset according to malignancy status. For this stratification we rely primarily on a CNV (copy number variation) metric.
We also hoped to (a) identify gene markers or gene sets correlated with malignancy status, and (b) associate specific cell types from clustered scRNA-seq data with malignancy status.
Workflow
Workflow Steps and Code Bits
Step 1: Seurat
For each tumor sample, run Seurat to extract the following: normalized read counts (SCT) per gene per cell, assigned cluster per cell, UMAP coordinates, average expression per gene per cluster, and marker genes per cluster. The R workflow using Seurat can be found at working_data/seurat_simple/{sampleID}/protocol-{sampleID}.Rtxt
Step 2: Process inferCNV data.
(Thanks to Matt B for providing inferCNV outputs for each of the 8 tumor samples using InferCNV.) From the ~.observations.txt and ~.references.txt files output from inferCNV, the summed and average |locusScore-medianScore| were calculated over all available loci to provide aggregate CNV metrics per cell.
id = samples[S];
txtfile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
out=paste("cellID", "cellGroup", "aggregate", "locusCt", "median", sep="\t");
write.table(out, file=txtfile, sep="\t", quote=F, append=FALSE, row.names=F, col.names=F);
infileR=paste(id, ".infercnv.references.txt", sep="");
dataR=read.delim(infileR, header=TRUE, row.names=1);
nC=ncol(dataR); # cells
nL=nrow(dataR); # loci
medianR=median(unlist(dataR));
for (N in 1:nC) {
dd=dataR[,N];
delta=abs(dd-medianR);
deltaSum=sum(delta);
thiscell=colnames(dataR)[N];
out=paste(thiscell, "reference", deltaSum, nL, medianR, sep="\t");
write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
}
infileT=paste(id, ".infercnv.observations.txt", sep="");
dataT=read.delim(infileT, header=TRUE, row.names=1);
nC=ncol(dataT); # cells
nL=nrow(dataT); # loci
medianT=median(unlist(dataT));
for (N in 1:nC) {
dd=dataT[,N];
delta=abs(dd-medianT);
deltaSum=sum(delta);
thiscell=colnames(dataT)[N];
out=paste(thiscell, "sample", deltaSum, nL, medianT, sep="\t");
write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
}
Then the distribution of the average |locusScore-medianScore| results were plotted to see if the reference and tumor samples were different. (They are!) After repeating over the 8 tumors, we noted that the distribution of these scores in the reference cells are consistently below 0.02, so we are using this as the threshold for what we are confident(???) are non-malignant cells. These plots can be viewed in working_data/inferCNV_distr.
id = samples[S];
infile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
data=read.delim(infile, header=TRUE);
ddR=subset(data, (data$cellGroup == "reference"));
ddS=subset(data, (data$cellGroup == "sample"));
loci=ddR[1,4];
xmin=0; xmax=0.08;
pngfile=paste(id, ".avg_medianDelta_per_cell.distr.png", sep="");
png(pngfile, h=500, w=500, res=120);
maintitle=paste(id, ": average |score-median|\n(", loci, " inferCNV loci)", sep="");
plot(density(ddR$aggregate/loci), lwd=2, col="black", main=maintitle, xlab="", xlim=c(xmin,xmax), ylim=c(0,yy[S]));
lines(density(ddS$aggregate/loci), lwd=2, col="limegreen");
legend("topright", c("reference","sample"), col=c("black","limegreen"), pch=20);
dev.off();
Step 3: Combine Seurat cell cluster and inferCNV data.
We then integrated the cell cluster assingments from Seurat with the aggregate inferCNV score we calculated, then visualized via UMAP to look at relative location of low/high CNV cells compared to cells with high expression of proliferation marker genes.
Screenshot Examples and Output
yellow: non-malignant
blue: malignant
Additional Visualizations
Annotation of cell clusters using PlangaoDB
Cells with aggregate inferCNV score in range of reference (assumed non-malignant (red)) cells.
Expression of selected markers
Future Directions
- Moving forward, we would like to add gene set enrichment analysis to use enriched gene sets consistent with the caner phenotype, including proliferation and developmental pathways. This would provide a more rigorous statistical score for each cluster as opposed to the proliferation score currently used.
- Furthermore, testing other tools for annotating gene clusters with their respective cell types would aid in identifying non-malignant clusters.
- Examine specific genes for abnormal CNV from the inferred CNV data (TP53, MDM2, RTK, RAS, PI3K, RB, CDK4)
- Use identified cancer cells on the scRNA-Seq dataset integrating all 8 donors to identify an improved boundary (such as using supporter vector machines) rather than the principal components 1 & 2 defined in the paper.