Abstract
Tutorial on how to use the RNAseqNet package to infer networks from RNA-seq expression datasets with or without an auxiliary dataset.The R package RNAseqNet
can be used to infer networks from RNA-seq expression data. The count data are given as a \(n \times p\) matrix in which \(n\) is the number of individuals and \(p\) the number of genes. This matrix is denoted by \(\mathbf{X}\) in the sequel.
Eventually, the RNA-seq dataset is complemented with an \(n' \times d\) matrix, \(\mathbf{Y}\) which can be used to impute missing individuals in \(\mathbf{X}\) as described in [Imbert et al., 2018]1.
Two datasets are available in the package: lung
and thyroid
with \(n = 221\) rows and respectively 100 and 50 columns. The raw data were downloaded from https://gtexportal.org/. The TMM normalisation of RNA-seq expression was performed with the R package edgeR
.
Data are loaded with:
data(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
data(thyroid)
boxplot(log2(thyroid + 1), las = 3, cex.names = 0.5)
Network inference from RNA-seq data is performed with the Poisson GLM model described in [Allen and Liu, 2012]2. The inference can be performed with the function GLMnetwork
as follows:
lambdas <- 4 * 10^(seq(0, -2, length = 10))
ref_lung <- GLMnetwork(lung, lambdas = lambdas)
The entry path
of ref_lung
contains length(lambdas)
= 10 matrices with estimated coefficients. Each matrix is a square matrix with ncol(lung)
= 100 rows and columns.
The choice of the most appropriate value for \(\lambda\) can be performed with the StARS criterion of [Liu et al., 2010]3, which is implemented in the function stabilitySelection
. The argument B
is used to specify the number of re-sampling used to compute the stability criterion:
set.seed(11051608)
stability_lung <- stabilitySelection(lung, lambdas = lambdas, B = 50)
plot(stability_lung)
The entry best
of stability_lung
is the index of the chosen \(\lambda\) in lambdas
. Here, the value \(\lambda=\) lambdas[stability_lung$best]
= 0.3097055 is chosen.
The corresponding set of estimated coefficients, is in ref_lung$path[[stability_lung$best]]
and can be transformed into a network with the function GLMnetToGraph
:
lung_refnet <- GLMnetToGraph(ref_lung$path[[stability_lung$best]])
print(lung_refnet)
## IGRAPH 488850b UN-- 100 454 --
## + attr: name (v/c)
## + edges from 488850b (vertex names):
## [1] MT-CO1--MT-ND4 MT-CO1--SFTPA2 MT-CO1--hsa-mir-6723
## [4] MT-CO1--A2M MT-CO1--MT-CO2 MT-CO1--MT-CO3
## [7] MT-CO1--MT-RNR2 MT-CO1--MT-ATP6 MT-CO1--MT-ND1
## [10] MT-CO1--MTND2P28 MT-CO1--MT-ND6 MT-CO1--SAT1
## [13] MT-CO1--HSPB1 MT-CO1--MTND4P12 MT-CO1--HLA-DRA
## [16] MT-CO1--MTND1P23 SFTPB --SFTPA2 SFTPB --FN1
## [19] SFTPB --B2M SFTPB --NEAT1 SFTPB --SLC34A2
## [22] SFTPB --PGC SFTPB --CTSD SFTPB --TSC22D3
## + ... omitted several edges
set.seed(1243)
plot(lung_refnet, vertex.size = 5, vertex.color = "orange",
vertex.frame.color = "orange", vertex.label.cex = 0.5,
vertex.label.color = "black")
In this section, we artificially remove some of the observations in lung
to create missing individuals (as compared to those in thyroid
):
set.seed(1717)
nobs <- nrow(lung)
miss_ind <- sample(1:nobs, round(0.2 * nobs), replace = FALSE)
lung[miss_ind, ] <- NA
lung <- na.omit(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
The method described in [Imbert et al., 2018] is thus used to infer a network for lung
expression data, imputing missing individuals from the information provided between gene expressions by the thyroid
dataset.
The first step of the method is to choose a relevant value for the donor list parameter, \(\sigma\). This is done computing \(V_{\textrm{intra}}\), the intra-variability in donor pool, for various values of \(\sigma\). An elbow rule is thus used to choose an appropriate value:
sigmalist <- 1:5
sigma_stats <- chooseSigma(lung, thyroid, sigmalist)
p <- ggplot(sigma_stats, aes(x = sigma, y = varintra)) + geom_point() +
geom_line() + theme_bw() +
ggtitle(expression("Evolution of intra-pool homogeneity versus" ~ sigma)) +
xlab(expression(sigma)) + ylab(expression(V[intra])) +
theme(title = element_text(size = 10))
print(p)
Here, \(\sigma = 2\) is chosen. Finally, hd-MI is processed with the chosen \(\sigma\), a list of regularization parameters \(\lambda\) that are selected with with the StARS criterion (from B = 10
subsamples) in m = 100
replicates of the inference, all performed on a different imputed dataset. The function imputedGLMnetwork
is the one implementing the full method. The distribution of edge frequency among the m = 100
inferred network is obtained with the function plot
applied to the result of this function.
set.seed(16051244)
lung_hdmi <- imputedGLMnetwork(lung, thyroid, sigma = 2, lambdas = lambdas,
m = 100, B = 10)
plot(lung_hdmi)
Finally, the final graph is extracted using the function GLMnetToGraph
on the result of the function ``imputedGLMnetwork``` and providing a threshold for edge frequency prediction.
lung_net <- GLMnetToGraph(lung_hdmi, threshold = 0.9)
lung_net
## IGRAPH a9aad3a UN-- 100 132 --
## + attr: name (v/c)
## + edges from a9aad3a (vertex names):
## [1] MT-CO1 --MT-ND4 MT-CO1 --MT-CO2
## [3] MT-CO1 --MT-CO3 MT-CO1 --MT-RNR2
## [5] MT-CO1 --MT-ND1 SFTPB --SLC34A2
## [7] SFTPB --PGC SFTPB --NAPSA
## [9] SFTPB --ABCA3 SFTPC --SFTPA1
## [11] MT-ND4 --MT-CYB MT-ND4 --MT-ND4L
## [13] SFTPA2 --SFTPA1 SFTPA2 --NAPSA
## [15] hsa-mir-6723--MTATP6P1 hsa-mir-6723--RP5-857K21.11
## + ... omitted several edges
set.seed(1605)
plot(lung_net, vertex.size = 5, vertex.color = "orange",
vertex.frame.color = "orange", vertex.label.cex = 0.5,
vertex.label.color = "black")
Here is the output of sessionInfo()
on the system on which this document was compiled:
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RNAseqNet_0.1.3 ggplot2_3.2.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.8 purrr_0.3.2
## [4] splines_3.6.3 lattice_0.20-40 hot.deck_1.1-1
## [7] colorspace_1.4-1 generics_0.0.2 htmltools_0.3.6
## [10] yaml_2.2.0 pan_1.6 survival_3.1-8
## [13] rlang_0.4.0 jomo_2.6-8 pillar_1.4.2
## [16] nloptr_1.2.1 glue_1.3.1 withr_2.1.2
## [19] foreach_1.4.4 stringr_1.4.0 munsell_0.5.0
## [22] gtable_0.3.0 codetools_0.2-16 evaluate_0.14
## [25] labeling_0.3 knitr_1.23 parallel_3.6.3
## [28] broom_0.5.2 Rcpp_1.0.1 scales_1.0.0
## [31] backports_1.1.4 lme4_1.1-21 digest_0.6.20
## [34] stringi_1.4.3 dplyr_0.8.3 grid_3.6.3
## [37] tools_3.6.3 magrittr_1.5 lazyeval_0.2.2
## [40] glmnet_2.0-18 tibble_2.1.3 mice_3.6.0
## [43] crayon_1.3.4 tidyr_0.8.3 pkgconfig_2.0.2
## [46] MASS_7.3-51.5 Matrix_1.2-18 assertthat_0.2.1
## [49] minqa_1.2.4 rmarkdown_1.14 PoiClaClu_1.0.2.1
## [52] iterators_1.0.10 rpart_4.1-15 mitml_0.3-7
## [55] R6_2.4.0 boot_1.3-24 igraph_1.2.4.1
## [58] nnet_7.3-12 nlme_3.1-144 compiler_3.6.3
Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Lefebvre, G., Gourraud, P.A., Viguerie, N. and Villa-Vialaneix, N. (2018) Multiple hot-deck imputation for network inference from RNA sequencing data. Bioinformatics. DOI: http://dx.doi.org/10.1093/bioinformatics/btx819↩
Allen, G. and Liu, Z. (2012) A log-linear model for inferring genetic networks from high-throughput sequencing data. In Proceedings of IEEE International Conference on Bioinformatics and Biomedecine (BIBM)↩
Liu, H., Roeber, K. and Wasserman, L. (2010) Stability approach to regularization selection (StARS) for high dimensional graphical models. In Proceedings of Neural Information Processing Systems (NIPS 2010), 23, 1432-1440, Vancouver, Canada.↩