Index of /data

 Name                                                             Last modified      	 Size  
Parent Directory

Enriched Benchmark Data
Readme.txt 28/04/2017
 enriched.string.10.net.rda 28/04/2017 113,8M
enriched.string.10.net.tsv 28/04/2017 195M
MeSH.associations.2.200.positives.04.17.rda 28/04/2017 247K




Novel candidate disease-genes on enriched benchmark data gene2disco.raw.scores.terms.without.positives.04.17.rda 28/04/2017 8.7M
 gene2disease.novel.assoc.terms.without.positives.tsv 28/04/2017 25.9K
 gene2disco.raw.scores.terms.1.200.positives.04.17.rda 30/04/2017 70M
 gene2disease.novel.assoc.terms.1.200.positives.tsv 30/04/2017 39.7K




Enriched benchmark data  includes
1) Gene
associations of MeSH disease terms with 2-200 known disease-genes downloaded at CTD database (last update 04.17);
2) STRING gene-gene interaction network (v.10) enriched by other gene-gene pairwise interaction sources.

Data format is described in the file Readme.txt.

Putative gene-disease associations inferred by Gene2DisCO algorithm are divided in two blocks:



Code

Installation

We provide the source code of  Gene2DiSCo along with an R script to run the algorithm on users' private data.  Here is the code archive, containing three files:
To use the Python code we need to correctly set the environment variable $PYTHONPATH to the current folder. It can be performed on UNIX systems by opening a shell  in the current folder and typing 
Moreover, the following Python modules needs to be installed: scipy, numpy.

The R code requires the installation of the R packages precrec, PerfMeas and rPython. The can be installed by opening an R console and typing

 

The two commands above will also install eventual R packages on which the package being installed. Such dependencies may require, depending on the host system, the installation of further system libraries.


Usage of Gene2DiSCO
Here we show the usage of the algorithm step by step. First we need to inlcude the Gene2DiSCo source code with the following command:
 
Then we load the two R libraries needed to compute the performance:

Now we are ready to load the input data:

The first command loads the matrix Y of gene-disease associations, while the second one loads the gene pairwise connection matrix W. Finally, the last command load in memory the matrix ancestors needed by Gene2DiSCo to filter out descendants diseases in the hierarchy.

The following line are optional, and ensure that the same indexes in the matrices Y and W correspond to the same genes:
 
We can run now the cross validation procedure to evaluate the generalization capabilities of Gene2DiSCo:
 

The command above run a 5-fold CV on the data loaded before using the L3 distance in the fuzzy clustering algorithm and the 0.5-quantile of the empirical distribution of memberships as threshold to discard negatives during the training. out is an R list including different fields (see the function description embed in the file), among which out$scores is the matrix of the inferred scores (rows are genes, columns are diseases), whereas out$auc and out$auprc  are the vectors containing the area under the ROC curve and precision-recall curve respectively for each disease.  out$pxr is the matrix containing the precisions at given recall values, with diseases on the rows and recall values on the columns. The following lines can be used to print on the screen the obtained results averaged across diseases :

  
Note that, due to the large number of gene and diseases, the code above may take time. To provide an idea of the remaining time, the code print a dot each 5% completed.