A step-by-step application of COSNet to protein function prediction



The COSNet R package Prev. : Main functionalities of COSNet Overview of the COSNet R Package Software and documentation download Software installation Main functionalities of COSNet A step-by-step application of COSNet to protein function prediction	A step-by-step application of COSNet to protein function prediction In this section we show an example of usage of COSNet for the prediction of functions of yeast (S. cerevisiae) proteins by using the Gene Ontology (GO) annotations as protein labels. Loading protein network. Loading protein GO annotations. Selecting the fold to be predicted. Running COSNet. Testing COSNet through a cross validation procedure. Loading protein network An example of protein network can be downloaded at http://frasca.di.unimi.it/cosnetdata/u.sum.yeast.txt, through the R command W <- as.matrix(read.table(file=paste(sep="", "http://frasca.di.unimi.it/cosnetdata/u.sum.yeast.txt"), sep=" ")) which load in the R object W a named matrix of dimension 5775 x 5775. This matrix contains for any position (i, j) an index of similarity in [0,1] between protein i and j. For details about the construction of the network, please visit http://frasca.di.unimi.it/cosnetdata/readme.txt. It is possible to read the names of rows and columns, which are the yeast protein identifiers, with the following commands namesR <- rownames(W) namesC <- colnames(W) namesR[1:5] namesC[1:5] The last two instructions visualize the first 5 elements of the row and column names respectively. To avoid self-loops in the network, the matrix has null diagonal. Moreover the matrix must be symmetric for ensuring the convergence of the Hopfield network dynamics. Alternatively, we can construct a toy similarity matrix, as follows W2 <- matrix(runif(100), nrow = 10, ncol = 10) which creates a square matrix of dimension 10 with values uniformly generated in the [0,1] interval. Then we set to 0 the diagonal, assign row and column names and assure the matrix is symmetric. rownames(W2) <- colnames(W2) <- paste0("p", 1:10); diag(W2) <- 0 W2 <- (W2+t(W2))/2 In particular, the last row sum the matrix W2 with its transpose, and then divides each element of the matrix by 2, to ensure the values are still in the [0,1] interval. Loading protein GO annotations The labels we use in this example for the yeast proteins are the GO functions, which can be downloaded at http://frasca.di.unimi.it/cosnetdata/GO.ann.yeast.28.03.13.3_300.txt. For details about the construction of the label matrix, again visit http://frasca.di.unimi.it/cosnetdata/readme.txt. Y <- as.matrix(read.table(file=paste(sep="", "http://frasca.di.unimi.it/cosnetdata/GO.ann.yeast.28.03.13.3_300.txt"), sep=" ")) Y is a named 5775 x 3469 0/1 matrix, containing the annotations of the considered yeast proteins for 3469 GO functions, belonging to all the branches (BP, MF, CC) of the Gene Ontology DAG. To have the most unbalanced classes, but not too generic, we have selected those GO functions with at least 1 and at most 300 positives. The value 1 in position (i, j) means that the protein in position i is positive for the j-th GO function, 0 otherwise. In other words, we can see each column of the matrix Y as the labelling of a single class to be predicted. The row names are identical to namesR, whereas the column names are the selected GO terms. The proteins selected have at least one annotations for any GO term, but since we have filtered the terms with 1-300 annotated proteins, some proteins in Y may have annotations (that is the corresponding row in Y is the null vector). Here, negative examples for one class are those proteins which are not annotated (positive) for that class. Note that this does not exclude that in future some of these proteins will become positive after novel studies. In this scenario, an unlabelled example can be any of the negative (or non positive) example that researchers want to investigate being candidate for the current class. To test COSNet, in sections Selecting the fold to be predicted and Running COSNet we hide a random subset of protein labels, and predict it by using COSNet. Even for the labelling, we can alternatively create a toy example as follows: Y2 <- matrix(sample(c(0, 1, 0, 0, 0), 30, replace=TRUE) , nrow = 10) Y2 [,1] [,2] [,3] [1,] 0 0 1 [2,] 0 0 0 [3,] 0 0 0 [4,] 1 0 0 [5,] 0 0 0 [6,] 0 1 0 [7,] 0 0 0 [8,] 1 0 0 [9,] 0 0 1 [10,] 0 0 0 The first command generates 30 labels randomly drawn with replacement from the vector [0, 1, 0, 0, 0], and then uses them to fill a matrix Y2 with 10 rows. The expected proportion of positives in each column is thereby 0.2 . Finally, we can assign row and column names. rownames(Y2) <- paste0("p", 1:10) colnames(Y2) <- paste0("c", 1:3) Selecting the fold to be predicted To evaluate the COSNet performance, we randomly partition the proteins in 3 folds by using the package function find.division.strat, which ensures around the same proportion of positives in each fold. To do this, after the installation of the package, we have to load it in the R environment with the R command library. library(COSNet) n <- nrow(W) y <- Y[, 1]; folds <- find.division.strat(y, 1:n, 3) the function nrow provides the number of rows of W, that the number of proteins we have considered. As labelling we have considered the first column of matrix Y, which has 71 positives out of 5775.Since COSNet needs a -1/0/1 labelling, -1 for negatives, 0 for unlabelled and 1 for positive instances, we first transform y in a -1/1 vector, then set to 0 the labels on the fold we want to predict. y <- Y[, 1]; y[y == 0] <- -1; y[folds[[1]]] <- 0 We have set to 0 the labels in the first fold. Running COSNet We can now run COSNet to predict the hidden labels: out <- COSNet(W, y, cost=0) outR <- COSNet(W, y, cost=0.0001) We suggest to see the Reference Manual for details about the parameters. We can read the inferred binary predictions in the field pred of the object out, whereas the discriminant scores are contained in the filed scores. A discriminant score is a real value such that the higher the value, the more likely the protein belongs to the considered class. out$pred[1:5] YAL024C YAL010C YAL028W YAL002W YAL053W -1 -1 -1 -1 -1 out$scores[1:5] YAL024C YAL010C YAL028W YAL002W YAL053W -0.2216668 -0.1652760 -0.1623203 -0.1654453 -0.1658599 We can now count for example the number of true and false negatives and true and false positives, by using the names of the vector out$pred to find the corresponding labels, to ensure we consider the same instances. TP <- sum(out$pred == 1 & Y[testN, 1]==1) TP [1] 16 FP <- sum(out$pred == 1 & Y[testN, 1]==0) FP [1] 0 FN <- sum(out$pred == -1 & Y[testN, 1]==1) FN [1] 8 TN <- sum(out$pred == -1 & Y[testN, 1]==0) TN [1] 1901 Please see the Reference Manual and the Bioconductor vignette for details about the other fields of the out object. Testing COSNet through a cross validation procedure In this section we apply a cross validation procedure to evaluate the generalization capabilities of COSNet. In particular, at each step the labels of one fold are hidden and then predicted by COSNet, by using the labels in the remaining folds. At the end of the procedure, a label for the current class is inferred for any protein. We apply the cosnet.cross.validation function provided by the package to perform a 3-fold cross validation: y <- as.matrix(Y[ ,1]) y[y == 0] <- -1; CV <- cosnet.cross.validation(y, W, nfolds = 3, cost = 0) CV2 <- cosnet.cross.validation(y, W, nfolds = 3, cost = 0) CVreg <- cosnet.cross.validation(y, W, nfolds = 3, cost = 0.0001) The instructions above run the cross validation procedure for both the not regularized and regularized version of COSNet. Since the function is able in performing the cross validation at the same time on all the functional classes, the input labels are expected forming a matrix. For this reason we have forced it by means of the R function as.matrix. For the regularized version, we have set the cost parameter to 0.0001, after a tuning on a small part of the training set. We suggest to tune this parameter when changing data set. See the Reference Manual for details about this parameter. The object CV contains both binary predictions and discriminant scores, in addition to the input labels. Now it is possible to compute any performance measure the user needs. Here, as done before, we compute for instance the number of true positives, true negatives, false positives and false negatives. TP <- sum(CV$predictions == 1 & CV$labels[, 1] == 1) TPreg <- sum(CVreg$predictions == 1 & CVreg$labels[, 1] == 1) FP <- sum(CV$predictions == 1 & CV$labels[, 1] == -1) FPreg <- sum(CVreg$predictions == 1 & CVreg$labels[, 1] == -1) FN <- sum(CV$predictions == -1 & CV$labels[, 1] == 1) FNreg <- sum(CVreg$predictions == -1 & CVreg$labels[, 1] == 1) TN <- sum(CV$predictions == -1 & CV$labels[, 1] == -1) TNreg <- sum(CVreg$predictions == -1 & CVreg$labels[, 1] == -1) and display it cat(sep = "\t", TP, FP, FN, TN, "\n") 47 21 24 5683 cat(sep = "\t", TPreg, FPreg, FNreg, TNreg, "\n") 47 12 24 5692 Finally, since the division in folds is randomly performed, and the COSNet dynamics updates neurons in a randomly defined order (see reference [1] in the section Overview of the COSNet R Package), the obtained prediction may slightly differ in from one execution to another one. Indeed, the results of the second not regularized execution, contained in the object CV2, are slightly different from those contained in CV. TP2 <- sum(CV2$predictions == 1 & CV2$labels[, 1] == 1) FP2 <- sum(CV2$predictions == 1 & CV2$labels[, 1] == -1) FN2 <- sum(CV2$predictions == -1 & CV2$labels[, 1] == 1) TN2 <- sum(CV2$predictions == -1 & CV2$labels[, 1] == -1) cat(sep = "\t", TP2, FP2, FN2, TN2, "\n") 49 18 22 5686