Index of /data

 Name                                                             Last modified      	 Size  
  Parent Directory
     
Enriched Benchmark Data 
 Readme.txt	 						  28/04/2017    
 enriched.string.10.net.rda					  28/04/2017 		113,8M
 enriched.string.10.net.tsv 					  28/04/2017 		195M
 MeSH.associations.2.200.positives.04.17.rda	 	 	  28/04/2017		247K
   
 

 
Novel candidate disease-genes on enriched benchmark data
 gene2disco.raw.scores.terms.without.positives.04.17.rda	  28/04/2017		8.7M
 gene2disease.novel.assoc.terms.without.positives.tsv	  	  28/04/2017		25.9K
 gene2disco.raw.scores.terms.1.200.positives.04.17.rda	          30/04/2017		70M
 gene2disease.novel.assoc.terms.1.200.positives.tsv	  	  30/04/2017		39.7K

Enriched benchmark data includes
1) Gene associations of MeSH disease terms with 2-200 known disease-genes downloaded at CTD database (last update 04.17);
2) STRING gene-gene interaction network (v.10) enriched by other gene-gene pairwise interaction sources.

Data format is described in the file Readme.txt.

Putative gene-disease associations inferred by Gene2DisCO algorithm are divided in two blocks:

Diseases with no annotated genes.
Diseases with 1-200 known disease-genes.

Code

Installation

We provide the source code of Gene2DiSCo along with an R script to run the algorithm on users' private data. Here is the code archive, containing three files:

gene2disco.R: R code implementing the Gene2DiSCo algorithm and the necessary routines.

main.R: R script to test the Gene2DiSCo algorithm on the enriched benchmark data and/or on users' data in a cross validation setting.
negative_selection Folder containing the Python source code required by Gene2DiSCo to perform the fuzzy clustering

To use the Python code we need to correctly set the environment variable $PYTHONPATH to the current folder. It can be performed on UNIX systems by opening a shell in the current folder and typing

export PYTHONPATH=.

Moreover, the following Python modules needs to be installed: scipy, numpy.

The R code requires the installation of the R packages precrec, PerfMeas and rPython. The can be installed by opening an R console and typing

> install.packages("precrec", dependencioes=TRUE);
> install.packages("PerfMeas", dependencioes=TRUE);
> install.packages("rPython", dependencioes=TRUE);

The two commands above will also install eventual R packages on which the package being installed. Such dependencies may require, depending on the host system, the installation of further system libraries.

Usage of Gene2DiSCO

Here we show the usage of the algorithm step by step. First we need to inlcude the Gene2DiSCo source code with the following command:

> source("gene2disco.R")

Then we load the two R libraries needed to compute the performance:

> library(precrec)
> library(PerfMeas)
> library(rPython)

Now we are ready to load the input data:

> load("MeSH.associations.2.200.positives.04.17.rda"); # Y	
> load("enriched.string.10.net.rda"); # W	  	  
> load("ancestors.2.200.positives.04.17.rda"); # ancestors

The first command loads the matrix Y of gene-disease associations, while the second one loads the gene pairwise connection matrix W. Finally, the last command load in memory the matrix ancestors needed by Gene2DiSCo to filter out descendants diseases in the hierarchy.

The following line are optional, and ensure that the same indexes in the matrices Y and W correspond to the same genes:

> genes <- rownames(Y);
> W <- W[genes,genes];
> rm(genes); gc()

We can run now the cross validation procedure to evaluate the generalization capabilities of Gene2DiSCo:

The command above run a 5-fold CV on the data loaded before using the L3 distance in the fuzzy clustering algorithm and the 0.5-quantile of the empirical distribution of memberships as threshold to discard negatives during the training. out is an R list including different fields (see the function description embed in the file), among which out$scores is the matrix of the inferred scores (rows are genes, columns are diseases), whereas out$auc and out$auprc are the vectors containing the area under the ROC curve and precision-recall curve respectively for each disease. out$pxr is the matrix containing the precisions at given recall values, with diseases on the rows and recall values on the columns. The following lines can be used to print on the screen the obtained results averaged across diseases :

> cat("AUC = ", mean(out$auc), "\n");
> cat("AUPRC = ", mean(out$auprc), "\n");
> cat("PXR = ", colMeans(out$pxr), "\n");

Note that, due to the large number of gene and diseases, the code above may take time. To provide an idea of the remaining time, the code print a dot each 5% completed.