TopKLists show case for integrating miRNA measurements

With the R package TopKlists, different ranked lists can be integrated given a common set of item identifiers. This show case illustrates how to integrate ranked lists derived from different miRNA abundance measurement techniques. The sample data is based on data available on GEO (Series GSE51508, Pubmed ID:24445778):

For more detailed information about the functions provided by TopKLists, please have a look at the package Vignette and the reference manual.

Preparing the separate data sets

The raw datasets GSE51501 (BeadChip) and GSE51504 (NanoString) were normalized with the function 'normalize.quantiles' from the Bioconductor package 'preprocessCore'. Principal component analysis (PCA) was then used to verify, that the samples cluster into cell line and xenograft groups. Illumina probe names were aligned with Illumina miRNA names using the GEO annotation file GPL8179. For the final significance analysis, function 'SAM' from the R package 'samr' was used. The p-values were obtained using the function 'samr.pvalues.from.perms' (also 'samr' package). Differential expression analysis was applied to the raw counts from GSE51507 (HTS), with function 'DESeq' from the Bioconductor package 'DESeq2'.
The preprocessed files (space-separated and quoted) are available from the data directory (GSE51501, GSE51504, GSE51507). Note, that the list entries are ordered by their respective p-values.

Initiating the analysis

To run this example you have to load the TopKlists package, which is available on CRAN and can be installed using the following command. Note that for making use of the GUI, several dependencies must be met.
    install.packages("TopKLists")

Once TopKlists has been installed, it can be loaded.

    library(TopKLists)

Loading and filtering the data

First, the sample data for the three sets is loaded. Note, that each of the three lists contains the miRNA identifier and the p-values, which was used for ranking the items.
The compressed show case data can be downloaded here (right click and "save as").
    res_51501 = read.table("data/GSE51501_results.txt", header=TRUE, quote="\"", stringsAsFactors=FALSE)
    res_51504 = read.table("data/GSE51504_results.txt", header=TRUE, quote="\"", stringsAsFactors=FALSE)
    res_51507 = read.table("data/GSE51507_results.txt", header=TRUE, quote="\"", stringsAsFactors=FALSE)

Next, missing values are removed.

    res_51507 = res_51507[complete.cases(res_51507),]
    res_51501 = res_51501[complete.cases(res_51501),]
    res_51504 = res_51504[complete.cases(res_51504),]

As the final analysis should only consider items that are common to all three lists, we filter for these entries.

    common = intersect(intersect(res_51501$mirname, res_51507$mirname), res_51504$mirname)
    length(common)
## [1] 531

We see that there are 531 common miRNAs left for further analysis. In the next step, each data set is reduced to contain the common items only. Then, we filter for the common entries and create a data.frame containing the common items ordered by their respective p-values.

    res_51507_common = res_51507[res_51507$mirname %in% common,]
    res_51501_common = res_51501[res_51501$mirname %in% common,]
    res_51504_common = res_51504[res_51504$mirname %in% common,]
    data_common = data.frame(HTS = res_51507_common$mirname, BeadChip = res_51501_common$mirname, NanoString = res_51504_common$mirname, stringsAsFactors=FALSE)
    dim(data_common)
## [1] 531   3
    head(data_common)
##               HTS       BeadChip      NanoString
## 1     hsa-miR-143 hsa-miR-576-5p hsa-miR-199a-5p
## 2     hsa-miR-451 hsa-miR-490-5p     hsa-miR-451
## 3     hsa-miR-223 hsa-miR-139-5p     hsa-miR-144
## 4     hsa-miR-144    hsa-miR-223     hsa-miR-223
## 5 hsa-miR-199a-5p   hsa-miR-1233     hsa-miR-143
## 6    hsa-miR-133a   hsa-miR-1284     hsa-miR-150

Starting the GUI

Finally, we can run the GUI in order to integrate the three lists, by passing the combined data.frame lists as an input parameter.
    TopKListsGUI(data_common)
The iterative inference algorithm requires the specification of two tuning parameters, the distance δ and the pilot sample size ν; this is because unique results do not exist for the top-k list length problem. The selection of both parameters is described in detail in the TopKLists Vignette. The deltaplot is a helpful tool which helps to identify a suitable range of δ-values. Adequate values are found where the rate of the deltaplot's decrease begins to slow noticeably. The user is encouraged to use the GUI to explore the results for the suggested δ-values and to select the most appropriate one, while keeping it as small as possible.
        a=deltaplot(data_common,subset.lists=300,deltas=1:100)
For three input lists the graphical output (see below) is one deltaplot for each pair of lists, in our case 6 plots. We can see that the pairwise deltaplots indicate values in the range of 18 to 80 and averaging over all plots is a good compromise. When we calculate the truncation index via the GUI, the results are almost the same for δ-values between 40 and 70. Therefore we choose δ=40.

An appropriate range of ν-values is that which has little impact on the estimate of maxK. Its appropriateness can be again tested using the GUI. A general recommendation is a value between 2 and 10 for small datasets and one in the order of tens for larger datasets (10 is used as default). The selected value of ν=22 satisfies this requirement. These settings resulted in maxK=12, at which the input lists were cut (see screenshot).

Calculation of a consolidated aggregate list

The TopKListsGUI Summary Table already includes the aggregated miRNAs (tagged by YES) as a result of the Cross Entropy Monte Carlo (CEMC) function (tuning parameters set to default). To be able to obtain the optimized rank order of the miRNAs, and to compare the CEMC results when different distance measures are applied (Kendall's tau is the default method and is used in the GUI; Spearman's footrule is another option), one must execute the CEMC function from the console.

First, we estimate the overall list length maxK using the function j0.multi (for the selection of delta and nu see above),

        set.seed(123)
        res = j0.multi(data_common, d=40, v=22)

The value of maxK is saved in res$maxK. Then we obtain partial lists by truncating the original full lists at that length and save them in a list object.

        l1 =  as.character(data_common[1:res$maxK,1])
        l2 =  as.character(data_common[1:res$maxK,2])
        l3 =  as.character(data_common[1:res$maxK,3])
        input=list(l1,l2,l3)

Next, we need to define the underlying space (see the Vignette for an explanation); in this case as the union of all miRNAs present in all three lists.

        common=unique(unlist(input))
        space=list(common,common,common)

Finally, we can run the CEMC function for the two different distance measures. Because the union of all miRNAs only comprises 28 items, one has to increase the number of samples, N, in each iteration in the order of a few thousand (the other tuning parameters are set to their default values).

        outCEMC.kendall = CEMC(input, space, dm = "k")
        outCEMC.spearman = CEMC(input, space, dm = "s")

Sessioninfo

    print(sessionInfo())
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
## [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Austria.1252    
## 
## attached base packages:
## [1] parallel  grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] gtools_3.4.1         bioDist_1.34.0       KernSmooth_2.23-13  
##  [4] Biobase_2.22.0       BiocGenerics_0.8.0   GA_2.1              
##  [7] permute_0.8-3        matrixcalc_1.0-3     MASS_7.3-35         
## [10] vcd_1.3-2            mc2d_0.1-15          mvtnorm_1.0-0       
## [13] gWidgets2RGtk2_1.0-2 memoise_0.2.1        gWidgetsRGtk2_0.0-82
## [16] gWidgets_0.0-54      gWidgets2_1.0-5      digest_0.6.4        
## [19] RGtk2_2.20.31        TopKLists_1.0.3      knitr_1.7           
## 
## loaded via a namespace (and not attached):
##  [1] acepack_1.3-3.3     bitops_1.0-6        caTools_1.17.1     
##  [4] cluster_1.15.3      colorspace_1.2-4    evaluate_0.5.5     
##  [7] foreign_0.8-61      formatR_1.0         Formula_1.1-2      
## [10] gdata_2.13.3        gplots_2.14.2       highr_0.4          
## [13] Hmisc_3.14-5        lattice_0.20-29     latticeExtra_0.6-26
## [16] nnet_7.3-8          RColorBrewer_1.0-5  rpart_4.1-8        
## [19] splines_3.0.2       stringr_0.6.2       survival_2.37-7    
## [22] tools_3.0.2

Last update: Mon Nov 10 16:25:51 2014