With the R package TopKlists, different ranked lists can be integrated given a common set of item identifiers. This show case illustrates how to integrate ranked lists derived from different miRNA abundance measurement techniques. The sample data is based on data available on GEO (Series GSE51508, Pubmed ID:24445778):
For more detailed information about the functions provided by TopKLists, please have a look at the package Vignette and the reference manual.
install.packages("TopKLists")
Once TopKlists has been installed, it can be loaded.
library(TopKLists)
res_51501 = read.table("data/GSE51501_results.txt", header=TRUE, quote="\"", stringsAsFactors=FALSE) res_51504 = read.table("data/GSE51504_results.txt", header=TRUE, quote="\"", stringsAsFactors=FALSE) res_51507 = read.table("data/GSE51507_results.txt", header=TRUE, quote="\"", stringsAsFactors=FALSE)
Next, missing values are removed.
res_51507 = res_51507[complete.cases(res_51507),] res_51501 = res_51501[complete.cases(res_51501),] res_51504 = res_51504[complete.cases(res_51504),]
As the final analysis should only consider items that are common to all three lists, we filter for these entries.
common = intersect(intersect(res_51501$mirname, res_51507$mirname), res_51504$mirname) length(common)
## [1] 531
We see that there are 531 common miRNAs left for further analysis. In the next step, each data set is reduced to contain the common items only. Then, we filter for the common entries and create a data.frame containing the common items ordered by their respective p-values.
res_51507_common = res_51507[res_51507$mirname %in% common,] res_51501_common = res_51501[res_51501$mirname %in% common,] res_51504_common = res_51504[res_51504$mirname %in% common,] data_common = data.frame(HTS = res_51507_common$mirname, BeadChip = res_51501_common$mirname, NanoString = res_51504_common$mirname, stringsAsFactors=FALSE) dim(data_common)
## [1] 531 3
head(data_common)
## HTS BeadChip NanoString ## 1 hsa-miR-143 hsa-miR-576-5p hsa-miR-199a-5p ## 2 hsa-miR-451 hsa-miR-490-5p hsa-miR-451 ## 3 hsa-miR-223 hsa-miR-139-5p hsa-miR-144 ## 4 hsa-miR-144 hsa-miR-223 hsa-miR-223 ## 5 hsa-miR-199a-5p hsa-miR-1233 hsa-miR-143 ## 6 hsa-miR-133a hsa-miR-1284 hsa-miR-150
TopKListsGUI(data_common)
a=deltaplot(data_common,subset.lists=300,deltas=1:100)
An appropriate range of ν-values is that which has little impact on the estimate of maxK. Its appropriateness can be again tested using the GUI. A general recommendation is a value between 2 and 10 for small datasets and one in the order of tens for larger datasets (10 is used as default). The selected value of ν=22 satisfies this requirement. These settings resulted in maxK=12, at which the input lists were cut (see screenshot).
The TopKListsGUI Summary Table already includes the aggregated miRNAs (tagged by YES) as a result of the Cross Entropy Monte Carlo (CEMC) function (tuning parameters set to default). To be able to obtain the optimized rank order of the miRNAs, and to compare the CEMC results when different distance measures are applied (Kendall's tau is the default method and is used in the GUI; Spearman's footrule is another option), one must execute the CEMC function from the console.
First, we estimate the overall list length maxK using the function j0.multi (for the selection of delta and nu see above),
set.seed(123) res = j0.multi(data_common, d=40, v=22)
The value of maxK is saved in res$maxK. Then we obtain partial lists by truncating the original full lists at that length and save them in a list object.
l1 = as.character(data_common[1:res$maxK,1]) l2 = as.character(data_common[1:res$maxK,2]) l3 = as.character(data_common[1:res$maxK,3]) input=list(l1,l2,l3)
Next, we need to define the underlying space (see the Vignette for an explanation); in this case as the union of all miRNAs present in all three lists.
common=unique(unlist(input)) space=list(common,common,common)
Finally, we can run the CEMC function for the two different distance measures. Because the union of all miRNAs only comprises 28 items, one has to increase the number of samples, N, in each iteration in the order of a few thousand (the other tuning parameters are set to their default values).
outCEMC.kendall = CEMC(input, space, dm = "k") outCEMC.spearman = CEMC(input, space, dm = "s")
print(sessionInfo())
## R version 3.0.2 (2013-09-25) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## ## locale: ## [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252 ## [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C ## [5] LC_TIME=German_Austria.1252 ## ## attached base packages: ## [1] parallel grid stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages: ## [1] gtools_3.4.1 bioDist_1.34.0 KernSmooth_2.23-13 ## [4] Biobase_2.22.0 BiocGenerics_0.8.0 GA_2.1 ## [7] permute_0.8-3 matrixcalc_1.0-3 MASS_7.3-35 ## [10] vcd_1.3-2 mc2d_0.1-15 mvtnorm_1.0-0 ## [13] gWidgets2RGtk2_1.0-2 memoise_0.2.1 gWidgetsRGtk2_0.0-82 ## [16] gWidgets_0.0-54 gWidgets2_1.0-5 digest_0.6.4 ## [19] RGtk2_2.20.31 TopKLists_1.0.3 knitr_1.7 ## ## loaded via a namespace (and not attached): ## [1] acepack_1.3-3.3 bitops_1.0-6 caTools_1.17.1 ## [4] cluster_1.15.3 colorspace_1.2-4 evaluate_0.5.5 ## [7] foreign_0.8-61 formatR_1.0 Formula_1.1-2 ## [10] gdata_2.13.3 gplots_2.14.2 highr_0.4 ## [13] Hmisc_3.14-5 lattice_0.20-29 latticeExtra_0.6-26 ## [16] nnet_7.3-8 RColorBrewer_1.0-5 rpart_4.1-8 ## [19] splines_3.0.2 stringr_0.6.2 survival_2.37-7 ## [22] tools_3.0.2
Last update: Mon Nov 10 16:25:51 2014