The COSNet R package
Prev. : Main functionalities of COSNet
|
-
A step-by-step application of COSNet to protein function prediction
- In this section we show an example of usage of COSNet for the prediction of functions of yeast (S. cerevisiae) proteins by using the Gene Ontology (GO) annotations as protein labels.
- Loading protein network.
- Loading protein GO annotations.
- Selecting the fold to be predicted.
- Running COSNet.
- Testing COSNet through a cross validation procedure.
Loading protein network
- An example of protein network can be downloaded at http://frasca.di.unimi.it/cosnetdata/u.sum.yeast.txt, through the R command
- which load in the R object W a named matrix of dimension 5775 x 5775. This matrix contains for any position (i, j) an index of similarity in [0,1] between protein i and j. For details about the construction of the network, please visit http://frasca.di.unimi.it/cosnetdata/readme.txt.
- It is possible to read the names of rows and columns, which are
the yeast protein identifiers, with the following commands
- The
last two instructions visualize the first 5 elements of the row and
column names respectively. To avoid self-loops in the network,
the matrix has null diagonal. Moreover the matrix must be symmetric for
ensuring the convergence of the Hopfield network dynamics.
- Alternatively, we can construct a toy similarity matrix, as follows
- which
creates a square matrix of dimension 10 with values uniformly generated
in the [0,1] interval. Then we set to 0 the diagonal, assign row and
column names and assure the matrix is symmetric.
- In particular, the last row sum the matrix W2 with its
transpose, and then divides each element of the matrix by 2, to ensure
the values are still in the [0,1] interval.
-
Loading protein GO annotations
- The labels we use in this example for the yeast proteins are the GO functions, which can be downloaded at http://frasca.di.unimi.it/cosnetdata/GO.ann.yeast.28.03.13.3_300.txt. For details about the construction of the label matrix, again visit http://frasca.di.unimi.it/cosnetdata/readme.txt.
- Y is a named 5775 x 3469
0/1 matrix, containing the annotations of the considered yeast proteins
for 3469 GO functions, belonging to all the branches (BP, MF, CC) of
the Gene Ontology DAG. To have the most unbalanced classes, but not too
generic, we have selected those GO functions with at least 1 and at
most 300 positives. The value 1 in position (i, j) means that the protein in position i is positive for the j-th
GO function, 0 otherwise. In other words, we can see each column of the matrix Y as
the labelling of a single class to be predicted. The row names are identical to namesR, whereas the column names are the selected GO terms.
- The proteins selected have at least one annotations for any
GO term, but since we have filtered the terms with 1-300 annotated
proteins, some proteins in Y may have annotations (that is the
corresponding row in Y is the null vector). Here, negative examples for
one class are those proteins which are not annotated (positive) for
that class. Note that this does not exclude that in future some of
these proteins will become positive after novel studies.
- In this scenario, an unlabelled example can be any of the
negative (or non positive) example that researchers want to investigate
being candidate for the current class. To test COSNet, in sections Selecting the fold to be predicted and Running COSNet we hide a random subset of protein labels, and predict it by using COSNet.
- Even for the labelling, we can alternatively create a toy example as follows:
- The first command generates 30 labels randomly drawn with
replacement from the vector [0, 1, 0, 0, 0], and then uses them to fill
a matrix Y2 with 10 rows. The expected proportion of positives in each
column is thereby 0.2 .
- Finally, we can assign row and column names.
-
Selecting the fold to be predicted
- To evaluate the COSNet performance, we randomly partition the proteins in 3 folds by using the package function find.division.strat,
which ensures around the same proportion of positives in each fold. To
do this, after the installation of the package, we have to load it in
the R environment with the R command library.
- the function nrow
provides
the number of rows of W, that the number of proteins we have
considered. As labelling we have considered the first column of matrix
Y, which has 71 positives out of 5775.Since COSNet needs a -1/0/1
labelling, -1 for negatives, 0 for unlabelled and 1 for positive
instances, we first transform y in a -1/1 vector, then set to 0 the labels on the fold we want to predict.
-
- We have set to 0 the labels in the first fold.
-
Running COSNet
- We can now run COSNet to predict the hidden labels:
- We suggest to see the Reference Manual for details about the parameters.
- We can read the inferred binary predictions in the field pred of the object out, whereas the discriminant scores are contained in the filed scores.
A discriminant score is a real value such that the higher the value,
the more likely the protein belongs to the considered class.
- We can now count for example the number of true and false
negatives and true and false positives, by using the names of the
vector out$pred to find the corresponding labels, to ensure we
consider the same instances.
- Please see the Reference Manual and the Bioconductor vignette for details about the other fields of the out object.
-
Testing COSNet through a cross validation procedure
- In this section we apply a cross validation procedure to
evaluate the generalization capabilities of COSNet. In particular, at
each step the labels of one fold are hidden and then predicted by
COSNet, by using the labels in the remaining folds. At the end of the
procedure, a label for the current class is inferred for any protein.
We apply the cosnet.cross.validation function provided by the package to perform a 3-fold cross validation:
- The instructions above run the cross validation procedure
for both the not regularized and regularized version of
COSNet. Since the function is able in performing the cross validation
at the same time on all the functional classes, the input labels are
expected forming a matrix. For this reason we have forced it by means
of the R function as.matrix. For the regularized version, we have set the cost
parameter to 0.0001, after a tuning on a small part of the training
set. We suggest to tune this parameter when changing data set. See the Reference Manual for details about this parameter.
- The object CV contains both binary predictions and discriminant scores, in addition to the input labels.
Now it is possible to compute any performance measure the user needs.
Here, as done before, we compute for instance the number of true
positives, true negatives, false positives and false negatives.
- and display it
- Finally, since the division in folds is randomly performed,
and the COSNet dynamics updates neurons in a randomly defined order
(see reference [1] in the section Overview of the COSNet R Package),
the obtained prediction may slightly differ in from one execution to
another one. Indeed, the results of the second not regularized execution,
contained in the object CV2, are slightly different from those
contained in CV.
|
|