Index of /data
Name Last modified Size
Parent Directory
Enriched Benchmark Data
Readme.txt 28/04/2017
enriched.string.10.net.rda 28/04/2017 113,8M
enriched.string.10.net.tsv 28/04/2017 195M
MeSH.associations.2.200.positives.04.17.rda 28/04/2017 247K
Novel candidate disease-genes on enriched benchmark data
gene2disco.raw.scores.terms.without.positives.04.17.rda 28/04/2017 8.7M
gene2disease.novel.assoc.terms.without.positives.tsv 28/04/2017 25.9K
gene2disco.raw.scores.terms.1.200.positives.04.17.rda 30/04/2017 70M
gene2disease.novel.assoc.terms.1.200.positives.tsv 30/04/2017 39.7K
Enriched benchmark data includes
1) Gene associations of MeSH disease terms with 2-200 known disease-genes downloaded at CTD database (last update 04.17);
2) STRING gene-gene interaction network (v.10) enriched by other gene-gene pairwise interaction sources.
Data format is described in the file Readme.txt.
Putative gene-disease associations inferred by Gene2DisCO algorithm are divided in two blocks:
- Diseases with no annotated genes.
- Diseases with 1-200 known disease-genes.
Code
-
-
Installation
- We provide the source code of Gene2DiSCo along with an R script to run the algorithm on users' private data. Here is the code archive, containing three files:
- gene2disco.R:
R code implementing the Gene2DiSCo algorithm and the necessary routines.
- main.R:
R script to test the
Gene2DiSCo algorithm on the enriched benchmark data and/or on
users' data in a cross validation setting.
- negative_selection
Folder containing the Python source code required by Gene2DiSCo to perform the fuzzy clustering
To
use the Python code we need to correctly set the environment variable
$PYTHONPATH to the current folder. It can be performed on UNIX systems
by opening a shell in the current folder and typing
Moreover, the following Python modules needs to be installed: scipy, numpy.
The R code requires the installation of the R packages precrec, PerfMeas and rPython. The can be installed by opening an R console and typing
The
two commands above will also install eventual R packages on which the package
being installed. Such dependencies may require, depending on the host
system, the installation of further system libraries.
-
Usage of Gene2DiSCO
- Here we show the usage of the algorithm step by step. First we need to inlcude the Gene2DiSCo source code with the following command:
- Then we load the two R libraries needed to compute the performance:
- Now we are ready to load the input data:
- The first command loads the matrix Y of gene-disease associations, while the second one loads the gene pairwise connection matrix W. Finally, the last command load in memory the matrix ancestors needed by Gene2DiSCo to filter out descendants diseases in the hierarchy.
- The following line are optional, and ensure that the same indexes in the matrices Y and W correspond to the same genes:
- We can run now the cross validation procedure to evaluate the generalization capabilities of Gene2DiSCo:
The command above run a 5-fold CV on
the data loaded before using the L3 distance in the fuzzy clustering
algorithm and the 0.5-quantile of the empirical distribution of
memberships as threshold to discard negatives during the training. out is an R list including different fields (see the function description embed in the file), among which out$scores is the matrix of the inferred scores (rows are genes, columns are diseases), whereas out$auc and out$auprc are the vectors containing the area under the ROC curve and precision-recall curve respectively for each disease. out$pxr
is the matrix containing the precisions at given recall values, with
diseases on the rows and recall values on the columns. The following
lines can be used to print on the screen the obtained results averaged
across diseases :
- Note that, due to the large number of gene and diseases, the code
above may take time. To provide an idea of the remaining time, the code
print a dot each 5% completed.