Multiomics data integration and prediction with mixomics

4 minute read

Integrating RNA, phospho-proteomic, and proteomic data with PLS in Mixomics
pkg=c("mixOmics", "dplyr")

lapply(pkg, require, character.only = TRUE)
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
setwd("/Users/jason/Documents/git_repos/cancer_omics")

Given proteomics, rna, and phospho-proteomics datda, can you predict phosphorylation response to EGF treatment?

We have initial state for each of these 3 data typles. The goal is to predict the phosphorylation of MEKK4 and IdU after 1 hour of EGF treatment

Load proteomics and RNA data:

rna=read.csv("/Users/jason/Documents/git_repos/cancer_omics/rnaseqexononly_marcotte.csv")
prot= read.csv("/Users/jason/Documents/git_repos/cancer_omics/proteomics_consolidated.csv")

#set cell_line to row_names
rownames(rna)= rna$X
rownames(prot)= prot$X

rna=select(rna, -"X")
prot=select(prot, -"X")


#order by rownames
rna=rna[order(row.names(rna)),]
prot=prot[order(row.names(prot)),]

Load phospho-proteomics data:

phosphoprot=read.csv("/Users/jason/Documents/git_repos/cancer_omics/median_phospho_data.csv")
phosphoprot =phosphoprot[ order(phosphoprot$cell_line), ] #sort by cell_line name
#basal=phospho[phospho$treatment=="full",]

#control/baseline phospho data used for prediction
phospho=filter(phosphoprot, treatment=="EGF", time==0) #baseline phospho data
rownames(phospho)= phospho$cell_line #set cell_line to rowname
phospho=select(phospho, -c("treatment","time","cell_line","p.MKK4","p.GSK3b")) #drop unneeded & predictor columns


##predictor column
Y=filter(phosphoprot, treatment=="EGF", time==60)  #predicted column, EGF phoshorylation result 1hr after treatment
rownames(Y)= Y$cell_line

Y= select(Y,c('p.MKK4','p.GSK3b')) #drop other columns


#comparing repsonses for each phosphoprotein
boxplot(phospho, use.columns=TRUE, xlab="phospho-protein", ylab="phosphorylation level", main="Phoshporylation distributions") #skip non-numeric columns

Train-test split on data. Creating input X matrix and dependent matrix Y

#final list has 59 cell lines, split off 10% (6 lines from test set validation)
cell_inter=intersect(rownames(rna), rownames(prot))
cell_inter= intersect(cell_inter, rownames(phospho))

incommon=length(cell_inter)
#train/test: 90/10 split based on cell lines
train= head(cell_inter, incommon-6)
test= tail(cell_inter, 6)


rna_train=rna[rownames(rna) %in% train,]  %>% as.matrix()
prot_train=prot[rownames(prot) %in% train,]  %>% as.matrix()
phos_train=phospho[rownames(phospho) %in% train,] %>% as.matrix()

Ytrain=Y[rownames(Y) %in% train,] %>% as.matrix() #2 columns


rna_test=rna[rownames(rna) %in% test,] #%>% as.matrix()
prot_test=prot[rownames(prot) %in% test,] # %>% as.matrix()
phos_test=phospho[rownames(phospho) %in% test,] # %>% as.matrix()
Ytest=Y[rownames(Y) %in% test,] %>% as.matrix()


#mixomics expects list of matrices (with each matrix containing a datatype)
X <- list(RNA = rna_train ,
          phospho = phos_train,
          protein = prot_train)

Fit regression model.

#reduce complexity, remove if you have low sample number
#keepx: number of (top) variables to select from the first two components of PCA
#list.keepX <- list(RNA = c(20, 20), phospho = c(8,8), protein = c(5, 5))

#features are centered and scaled by default
MyResult.diablo <- block.pls(X, Ytrain)
## Warning in cor(A[[k]], variates.A[[k]]): the standard deviation is zero

## Warning in cor(A[[k]], variates.A[[k]]): the standard deviation is zero
loadings1=selectVar(MyResult.diablo, comp = 1) #loadings for component 1
loadings2=selectVar(MyResult.diablo, comp = 2) #loadings for component 2

#loadings$RNA$value # will have loadinpgs for RNA block

#raw data can be pulled from :
#MyResult.diablo$X


#number of components used for each block:
#MyResult.diablo$ncomp

#variates
#MyResult.diablo$variates

#weighting on each components
#MyResult.diablo$weights

Plot where individual cell lines fit on new components in each data type.

#reduce complexity, remove if you have low sample number
#keepx: number of (top) variables to select from the first two components of PCA
#list.keepX <- list(RNA = c(20, 20), phospho = c(8,8), protein = c(5, 5))

#features are centered and scaled by default
MyResult.diablo <- block.pls(X, Ytrain)
## Warning in cor(A[[k]], variates.A[[k]]): the standard deviation is zero

## Warning in cor(A[[k]], variates.A[[k]]): the standard deviation is zero
plotIndiv(MyResult.diablo, ind.names = FALSE)

Documentation on how the “predict” function works https://rdrr.io/cran/mixOmics/man/predict.html

Generating predictions based on the pls model and evaluating the predictions.

#assembling test data
Xtest <- list(RNA = rna_test ,
          phospho = phos_test,
          protein = prot_test)

#generating predictions and extracting data
ypredict=predict(MyResult.diablo, newdata = Xtest)


#missing values in proteomic test data lead to NaN prediction from that block
#instead of averaging from all 3 datatypes, average from the RNA and phospho block

#grab dim2 columns and average
avg_pred=(ypredict$predict$RNA[,,2] +ypredict$predict$phospho[,,2])/2

#phosphorylation predictions aftr 1hr EGF treatment for 6 cell lines"
print(avg_pred)
##            p.MKK4  p.GSK3b
## T47D     3.552404 3.602751
## UACC3199 4.147216 3.684957
## UACC812  3.596225 3.201227
## UACC893  3.266918 3.536389
## ZR751    3.655112 3.495032
## ZR7530   3.680280 3.386271
#calculate RMSE
rmse=colMeans((avg_pred- Ytest)^2) %>% sqrt()
print(rmse)
##    p.MKK4   p.GSK3b 
## 0.4893550 0.2828894

Weighting of each feature in each loading.

plotVar(MyResult.diablo, overlap= F) # too many features to visualize

Top features for each loading in component 1 and 2

#component 1 loading
plotLoadings(MyResult.diablo,comp=1, contrib = "max")

#component 2 loading
plotLoadings(MyResult.diablo,comp=2, contrib = "max")

How do the top loadings match expectations? The phospho-proteomic data has the largest loadings by magnitude and largest proportion of explained variance. The top loadings for the phospho-proteome align with expectations for both MEKK4 and p.GSK3b.

  • MEKK4 is mitogen activated protein kinase, and is expected to be phosphoylated with mitogen treatment (such as EGF). Within the network of MEKK4 signaling are: other MAPK/MEKK/ERK signaling and TGF-beta/SMAD signaling. We find the components of the loadings to be significant and in the same direction as p.MKK4.
#correlation with downregulation of p.mkk4
head(loadings1$phospho$value,15)
##              value.var
## p.STAT3     -0.4669730
## Ki.67       -0.4034507
## p.MKK3.MKK6 -0.3238726
## CyclinB     -0.2813139
## p.p38       -0.2793142
## p.AMPK      -0.2603531
## p.SMAD23    -0.2192462
## p.MEK        0.1803140
## p.STAT1     -0.1742372
## p.MAPKAPK2   0.1476032
## p.4EBP1      0.1432966
## p.JNK       -0.1420797
## cleavedCas  -0.1367148
## p.PDPK1     -0.1240463
## p.ERK       -0.1137064

With component 2, we can better see correlations with p-GSK3b. We see a similar trend with known targets beta catenin and 4EBP1.

#correlation with downregulation of p.mkk4
head(loadings2$phospho$value,15)
##              value.var
## p.4EBP1     -0.4440809
## p.MAPKAPK2  -0.3852125
## Ki.67       -0.3238550
## CyclinB     -0.3192085
## p.MEK        0.2697003
## IdU         -0.2340423
## p.MKK3.MKK6  0.1771517
## GAPDH       -0.1707746
## p.CREB      -0.1647700
## p.FAK       -0.1510662
## p.PLCg2     -0.1425628
## b.CATENIN   -0.1421492
## p.STAT1     -0.1373693
## p.p53       -0.1320146
## p.SMAD23     0.1256009

Resources:

https://mixomicsteam.github.io/Bookdown/ https://www.bioconductor.org/packages/release/bioc/vignettes/mixOmics/inst/doc/vignette.html man page: https://www.bioconductor.org/packages/devel/bioc/manuals/mixOmics/man/mixOmics.pdf https://rstudio-pubs-static.s3.amazonaws.com/376158_5a59d491abfd4266acaa05773bc1751a.html

Updated: