Genes & Genetic Systems
Online ISSN : 1880-5779
Print ISSN : 1341-7568
ISSN-L : 1341-7568
Full papers
Protein-protein interaction prediction by combined analysis of genomic and conservation information
Abbasali EmamjomehBahram Goliaei Ali TorkamaniReza EbrahimpourNima MohammadiAhmad Parsian
著者情報
ジャーナル オープンアクセス HTML

2014 年 89 巻 6 号 p. 259-272

詳細
ABSTRACT

Protein-protein interactions (PPIs) are highly important because of their main role in cellular processes and biochemical pathways; therefore, PPI can be very useful in the prediction of protein functions. Experimental techniques of PPI detection have certain drawbacks; hence computational methods can be used to complement wet lab techniques. Such methods can be applied to PPI prediction as well as validation of experimental results. Computational algorithms can lead to many false PPI predictions, which in turn result in non-adequate performance. We have developed a novel method based on combined analysis, entitled PPIccc. Three different descriptors for PPIccc included gene co-expression values, codon usage similarity and conservation of surface residues between protein products of a gene pair, which combined to predict PPI. Validation of results based on Human Protein Reference Database (HPRD) indicated improvement of performance in our proposed method. The results also revealed that conservation of surface residues between proteins in combination with codon usage similarity of their related genes increase the performance of PPI prediction. This means that codon usage similarity and surface residues between proteins (only sequence-based features) can predict PPIs as good as PPIccc.

INTRODUCTION

Genome sequencing projects have already been completed for a large number of species and soon the complete genome sequence of many more species will be added to this list. These genomic data have led to the emergence of new insights into identifying functional processes in biological macromolecules, such as proteins. Systems biology, with particular emphasis on network reconstruction, plays a crucial role in the discovery of the biological functions of such types of macromolecules (Franzosa et al., 2009). Network reconstruction has different applications in the analysis of biological data, e.g. such techniques have been used for the analysis of gene expression data in co-expression networks (Torkamani et al., 2010), identification of certain trends in mutational data (Torkamani and Schork, 2009), recognition of differences between biological information and disease pathways across species (Miller et al., 2010), and identification of regulatory networks and transcriptional relationships (Wang et al., 2009). In all biological networks, correlation patterns are used to infer genetic relationships. The basic logic for reconstruction of these types of biological networks is that, correlation patterns indicate common relationships among biological elements; therefore, a biological relationship can be inferred based on such correlations. Protein-protein interaction (PPI) is one of the most important biological relationships because of the significance of this macromolecule in all living organisms. This importance is not only due to their individual activities but also because of their specific interactions with other proteins (Sharon et al., 2009). Indeed, some components of protein complexes cannot be used in the cell unless they are in contact with other components of the complex (Zhang et al., 2004). Detection of PPIs have various perspectives in biology, for example in drug design (Wells and McClendon, 2007), map building of signaling pathways in the cells so as to better understand signal transduction in physiopathological processes (Pawson and Nash, 2000), predicting PPIs between species to find therapeutic strategies (Emamjomeh et al., 2014) and prediction of protein functions (Hou and Chi, 2012). Therefore, PPI network reconstruction will be highly useful in gaining a better understanding of molecular mechanisms in cells (Theofilatos, 2011).

There are different in vitro techniques for detection of physical PPIs. Essentially, two main categories of in vitro techniques are used to recognize PPI in wet-lab (Shoemaker and Panchenko, 2007b): low-throughput (e.g. X-ray crystallography, fluorescence resonance energy transfer, surface plasmon resonance and atomic force microscopy) and high-throughput (e.g. yeast two-hybrid system, affinity purification-mass spectroscopy, DNA microarray, protein chips, synthetic lethality and phage display) approaches. There is a gap between experimentally- detected PPIs and real ones (Zahiri et al., 2013a). On the other hand, there are some shortcomings in the results of PPI prediction using in vitro methods. For example, bias makes PPI predictions more inclined toward certain specific proteins such as globular proteins. Furthermore, in vitro methods can usually recognize permanent PPIs and therefore, cannot detect all PPIs (Zahiri et al., 2013b). Generally, network reconstruction approaches are very successful in unveiling regulatory relationships and other interesting biological phenomena, but they may lead to a large number of false positive interactions (Mahdavi and Lin, 2007). It has been shown that improvement of performance in such methods can be achieved by some sorts of computational methods (Rhodes et al., 2005), hence the incentive for the emergence of such methods to predict PPIs. The computational methods are regarded as a complement to the in vitro methods; in fact, combination of experimental and computational methods can outperform PPI predicted using each method, because of the reduction in the rate of false-positive generation (Mahdavi and Lin, 2007; Shoemaker and Panchenko, 2007a). There are different classes of the PPI prediction methods:

1. Machine learning-based methods; including random forest (Chen and Liu, 2005), support vector machines (SVM) (Ben-Hur and Noble, 2005; Lo et al., 2005; Shen et al., 2007), naïve Bayes (Lu et al., 2005) and multilayer perceptron (MLP) (Keedwell and Narayanan, 2005). Different sequence and non-sequence-based features are used for learning these methods.

2. Genomic context and structure of proteins-based methods; for example gene co-expression (Ideker et al., 2002), three-dimensional structural information (Aloy et al., 2004; Aloy and Russell, 2003), gene neighboring (Ideker et al., 2002), gene fusion (Enright et al., 1999) and phylogenetic relationships (Jothi et al., 2005).

3. Network topology-based methods (Chen et al., 2006; Liu et al., 2008).

4. Text mining or literature mining-based methods (Jaeger et al., 2008; Oyama et al., 2002).

Regardless of the above-mentioned methods, codon usage can also predict PPI. In fact, codon usage similarity between two genes can be applied to recognize co-expressed genes in yeast (Jansen et al., 2003). It goes without saying that co-expressed genes also have similar synonymous codons in human and some other living organisms (Najafabadi et al., 2009). It is also stated that codon usage of functionally and physically interacting proteins in a living organism is informative in predicting PPI (Najafabadi and Salavati, 2008). The evidence to this claim is that codon usage of interacting protein pairs differs significantly from that of randomly chosen ones (Zhou et al., 2012). This relationship may be due to function-specific codon usage, which is based on selective charging of the tRNA isoacceptors (Elf et al., 2003). The relationship has also been confirmed experimentally (Dittmar et al., 2005).

We know that physical PPI is created by surface amino acid residues between two proteins (Fraser et al., 2004), and all residue combinations are not equally acceptable as the contact residues in PPI (Lunt et al., 2010). Structural details of PPI have been conserved between homologous proteins in related species, and such surface residues are constrained at the interface between protein pairs (Lunt et al., 2010). Hence, PPIs can be predicted by the identification of mutually-constrained surface residues between interacting protein partners (Schug et al., 2009; Szurmant et al., 2008). This suggests that a PPI prediction approach can be possible using mutually-constrained residues (Procaccini et al., 2011); i.e. if two proteins share mutually-constrained residues across multiple species, then it can be inferred that those two proteins have interaction. We should consider that the correct detection of mutually-constrained residues between interacting protein partners cannot be accomplished by only one species (Morcos et al., 2011).

Reconstruction of biological networks can be achieved by different types of data. In our proposed method, we have used integration of data related to different levels of central dogma, to reconstruct PPI networks (Chen et al., 2001). In this work, we have relied upon PPI prediction using similarity between two genes in the light of their gene co-expression values, codon usage and identifying mutually-constrained surface residues between protein products of those two genes. Indeed, gene pairs with a high degree of gene expression correlation and codon usage similarity in addition to mutually-constrained residues across related species can be excellent candidates for PPI prediction. Three descriptors, representing different properties of a protein that have been shown very important in protein function and PPI detection, were used to characterize the protein. Genomic context information (codon usage similarity), Transcription-based information (gene co-expression values), and Structure-based information (conservation of surface residues between protein products of a gene pair). In addition to having comprehensive information about the protein with these descriptors, the conservation of surface residues is a novel feature in PPI prediction problem.

The aim of this work is PPI prediction where these three types of data are taken into account. Accordingly, this study has led to the development of a novel method called PPI prediction by integration of Co-expression, Codon usage and Conservation data (PPIccc, pronounced: PPI triple c), which can predict PPIs using integration of three descriptors (gene expression data, codon usage analysis and conserved regions of protein surface residues).

MATERIALS AND METHODS

It is expected that integration of gene expression values, codon usage similarity analysis and information related to evolutionary conserved regions in the surface residues of proteins can be applicable to high-performance PPI prediction (Fig. 1A). This article relied upon three main steps: determination of gene co-expression on a pre-processed gene dataset, computation of codon usage-based gene similarity for the same pre-processed gene dataset, and calculation of mutually-constrained between surface residues of their related proteins. At the final step, PPIs is predicted using five integrated and four non-integrated methods (nine methods in total). Then, performances of the nine methods are evaluated by a gold standard database. Human Protein Reference Database or HPRD (Prasad et al., 2009) has been used as a reference database for validation of the results. We used the PPIs that were confirmed by at least two different experimental methods in this database (Fig. 1B). It includes 15763 out of a total of 39,240 interactions; furthermore, the total number of proteins was 5632.

Fig. 1.

Overview of PPIccc method. (a) Usage of different levels of central dogma for PPI network reconstruction. Integration of codon usage similarity analysis with co-expressed genes (identified using microarray data) and mutually-constrained surface residues for each gene pair were applied to predict PPI. It finally leads to PPI network reconstruction. (b) HPRD database was applied for validation of results and discovery of accurate PPI.

Extracting and pre-processing of datasets

At the first step, the raw files (.CEL files) related to five melanoma datasets (namely, GSE8401, GSE22083, GSE12445, GSE12627 and GSE9118) consisting of 298 samples, were extracted from GEO/NCBI (Eskandarpour et al., 2009; Harlin et al., 2009; Muthusamy et al., 2006; Tock et al., 2011; Xu et al., 2008). These datasets were selected because of similarity in their experimental conditions. The microarray platform of the melanoma datasets was GPL96 or [HG-U133A] Affymetrix Human Genome U133A Array (Fig. 2A). Pre-processing steps must be performed to combine these datasets to remove statistical biases and then construct a corrected and combined dataset (Sims et al., 2008). The pre-processing phase consisted of three steps with regard to the raw files (.CEl files):

Fig. 2.

Pre-processing steps for microarray datasets. (a) Extracting of .CEL files related to 5 melanoma datasets. (b) Removing of bad probe-sets using cleaner1.03 (c) Data normalization and Presence/Absence calls (P/A calls) using MAS5.0 (d) Batch effect removal by ComBat. (e) Present/Absent genes after pre-processing steps.

a) Probe cleaning by Cleaner1.03 for the purpose of probe filtering and removal of low quality probe-sets (Alvarez et al., 2009) (Fig. 2B).

b) Data normalization (Hubbell et al., 2002) and Presence/Absence calls (P/A calls) for detection of present and absent genes (Warren, 2010) using MAS5.0 (Fig. 2C).

c) Removal of batch effects by ComBat (Johnson et al., 2007) (Fig. 2D). The final dataset consisted of expression values related to 7569 human genes (Fig. 2E).

Acquisition of co-expressed genes

Algorithm for the Reconstruction of Accurate Cellular Networks or ARACNE (Margolin et al., 2006) was run on the pre-processed melanoma dataset as the input file, because it has low false-positive rate. In this package, co-expressed genes are recognized after computation of mutual information (MI) between each gene pair and removal of the weakest interaction in each gene triplet. MI is a good metric for determination of expression similarity between two genes and is preferred to the Pearson correlation coefficient, since this coefficient can reveal only linear and direct relationships (Daub et al., 2004). Mutual information is calculated for X and Y variables as:   

I( X;Y ) = i,j P( x i , y j )    log P( x i , y j ) P( x i )P( y j ) , (1)

Where, P(xi, yj) and P(xi) or P(yj) are joint and marginal probabilities for the expression values of genes X and Y, respectively. Maximal information coefficient (MIC) has been suggested as another metric to determine expression similarity between two genes (Reshef et al., 2011). MIC can evaluate all types of functional relationships between two genes, and also give similar scores for equally noisy relationships. We computed the MIC matrix between each gene pair by using the maximal information-based non-parametric exploration (MINE) software. This MIC matrix was also used as an input file for ARACNE. The ARACNE parameters were adjusted on these values: Kernel width using the accurate method = 0.15, MI threshold = 0.04 and DPI = 0.01. At the end of this step, we produced two output files for gene co-expression using ARACNE; a file consisting of co-expressed genes based on the MI and another file including co-expressed genes based on the MIC. At the final step, these two sorts of gene co-expression matrices were used for PPI prediction.

Computation of codon usage similarity between genes

This phase consisted of four steps (Fig. 3): a) Extracting coding region sequences for human genes; b) Real-score computation for codon usage similarity of two genes using Fisher exact test (FET) (Conniffe, 1991) and the Fisher combined probability test (Fisher’s method); c) Final p-value calculation by iterated simulation of human gene sequences and; d) Final comparison of codon usage similarity between human genes or statistical significant test.

Fig. 3.

Overview of Computation of codon usage similarity between two genes. (a) Extraction of coding sequences of human genes. (b) Real score calculation for codon usage similarity of two genes using FET. (c) Iterated substitution of human genes' sequences. (d) Calculation of final p-value for comparison of two genes as codon usage. (e) Overall flowchart for computation of final p-value to compare two genes in the light of codon usage.

Extraction of coding sequences of human genes

Coding region sequences of human genes are extracted for codon usage similarity analysis of each gene pair. For this purpose, the whole genome of Homo sapiens and its related known gene file containing the entire information of human genes were downloaded from the genome browser at the University of California, Santa Cruz (UCSC) (Kent et al., 2002) (Fig. 3A). Afterwards, coding region sequences of human genes are trimmed based on known gene file information, e.g. chromosomal location, strand orientation, and number of exons and the start/end points of exons.

Real score for codon usage similarity of two genes

FET is a statistical test for the analysis of contingency tables with small sample size, and can be used for the codon usage similarity test of two genes (Plotkin et al., 2004). We used FET to calculate the real score for codon usage similarity of two genes (Fig. 3B). In this test, p-values are calculated separately for each amino acid regarding absolute frequency of synonymous codons between two genes, and finally all p-values related to whole amino acids of two genes are combined by Fisher’s method as follows:   

Combined p-value =-2 i=1 k ln( p i ) , (2)
where, pi and k are p-value and total number of amino acids related to synonymous codons, respectively. The combined p-value is considered as the real score between two genes (Table 1).
Table 1. Calculation of combined p-value using FET and Fisher’s method for comparison of two putative genes in the light of codon usage similarity
CodonGene 1Gene 2Amino acidP-valueCombined P-value
CUU128Leu0.071 2 i=1 k ln( p i ) =31.57
CUC712
CUA314
CUG710
AUU47Ile0.049
AUC167
AUA1524
AAU1025asn4.00E-05
AAC225

Iterated substitution of human genes sequences

After computation of the real score for codon usage similarity of each gene pair, a large amount of random scores were generated for codon usage similarity of each gene pair (Fig. 3C). Thereafter, we produced numerous putative DNA sequences for each gene by substitution of synonymous codons corresponding to each amino acid without changing the protein sequence of the original gene. The DNA sequence simulation was based on the assumption that no change occurs in the protein sequence related to each transcript. The parameter N, set to 106, indicated the amount of random sequences that were considered to be sufficient for our sequence simulation. The combined p-value was calculated for each random generated gene sequence using FET and the Fisher’s method (i.e. random scores). Subsequently, the number of times, n, that random scores were greater than the real score for each gene pair was counted (Fig. 3D). This process is depicted in a schematic description (Fig. 3E). The final p-value for each gene pair was calculated as follows (with a pseudo count of 1):   

Final p-value=  ( 1+n ) ( 1+N ) , (3)

Where, N and n were described as mentioned above. Most of the implemented methods take advantage of parallelization techniques, and were performed using the computational cluster provided by the high performance computing (HPC) cluster in the Iranian Institute of Research in Fundamental Sciences (Tehran), comprised of ~400 computational cores. Due to the enormous size of the data and limited resources, the tasks took approximately six months to complete.

Final comparison of codon usage similarity

Null and alternative hypotheses (H0 and H1) of the codon usage similarity test indicated codon usage similarity and non-similarity of two genes, respectively. If the final p-value > 0.05 (high p-values, accepting H0), it means that the two genes have similar codon usage, and if the final p-value ≤ 0.05 (low p-values, rejecting H0), we can conclude that the two genes have different codon usage.

Calculating mutually-constrained conservation

It is necessary to use homologous proteins across related species for such prediction. It should also be mentioned that the correct identification of mutually-constrained surface residues between two proteins is based on known protein interactions; however, we need to computationally predict the likely residues involved in PPI.

For this purpose, total homologous sequences of each human protein related to nine available animal species (Table 2) were obtained from the publicly available HomoloGene-NCBI database. Various steps were performed for the calculation of this stage (Fig. 4). These human protein sequences are the products of the pre-processed gene dataset in the previous step. The multiple sequence comparison by log- expectation (MUSCLE) algorithm (Edgar, 2004) was then run for the purpose of multiple sequence alignment (MSA) between each human protein sequence and its homologous sequences (Fig. 4A). On the other hand, the surface residues of the human proteins were detected by prediction of solvent accessibility from protein sequence using random forest method or RSARF (Fig. 4B) (Pugalenthi et al., 2012). Furthermore, conserved blocks of surface residues were detected between each human protein sequence and its homologous sequences for each MSA (Fig. 4C). Accordingly, all possible pairs of conserved blocks were concatenated (Weigt et al., 2009) for similarity analysis regarding surface residues of their proteins (Fig. 4D). Finally, direct coupling analysis (DCA) was carried out for all concatenated conserved blocks and direct coupling (DI) values were calculated for all positions in each of them (Fig. 4E). DI values were calculated as follows (Lunt et al., 2010):   

D I ij = ( A i , A j ) P ij ( dir ) ( A i, A j ) ln P ij ( dir ) ( A i, A j ) f i( A i ) f j( A j ) , (4)
where DIij, fi(Ai) and fj(Aj), and Pij(dir)(Ai, Aj) are DI values between the ith and jth positions of homologous protein sequences, frequencies of residue A in the ith and jth positions of homologous protein sequences, and direct pair distribution which is related to two coupled variables with unique direct links, respectively.
Table 2. Species used in Homologene database
Scientific nameEnglish nameNCBI taxonomy ID
Mus musculus (M. musculus)House mouse10090
Rattus norvegicus (R. norvegicus)Rat10116
Danio rerio (D. rerio)Zebrafish7955
Gallus gallus (G. gallus)Red junglefowl9031
Macaca mulatta (M. mulatta)Rhesus macaque9544
Pan troglodytes (P. troglodytes)Common chimpanzee9598
Homo sapiens (H. sapiens)Human9606
Canis lupus familiaris (C. lupus)Dog9615
Bos Taurus (B. taurus)Cattle9913
Fig. 4.

An overview for calculating of mutually-constrained conservation. (a) Finding of homologous sequences of human proteins in its relatives using Homologene/NCBI database. (b) Determination of surface residues in each human protein using RSARF program. (c) Determination of conserved blocks in surface residues between each human protein and its homologous sequences in human’s relatives. (d) Concatenation of each conserved blocks pair of surface residues related to two proteins. (e) Calculation of DI for each concatenated blocks using DCA.

The DI value measures the direct coupling amount between two amino acid residues (positions i and j in a conserved block), and indeed it is the aspect of MI arising from the direct coupling information of two surface residues. We considered the maximum DI value among all residues of two proteins as the DI between the surface residue blocks of the same two proteins. DI can be considered analogous to MI; high (significant) values of DI or powerful direct coupling between two residues related to a protein pair might be exploited for prediction of physical contact for the same protein pair. To discover the proper statistical significance and determination of a threshold for DI, ROC curve is plotted which serves to visually illustrate sensitivity/specificity trade-off at varied thresholds (Fawcett, 2006). The ROC curve was drawn for true and false positive rate of different significant DI thresholds ranging from 0 to 1 (0.01, 0.02,… 0.99, 1). Then, we determined the optimum point of ROC in such a way that gradient of the ROC plot at the optimum point is equal to 5% of the maximum gradient of the curve. On the other hand, the optimum point of the ROC plot which contains the high true positive rate with a low false positive rate is used to determine the DI statistical significance threshold. In other words, threshold related to the optimum point in the ROC plot is considered as the statistical significant threshold of DI.

PPI prediction methods

Finally, we predicted PPIs using five methods involving different integration of gene co-expression data, codon usage similarity of genes and mutual constraint of surface residues for proteins. We also predicted PPIs using four non-integrated methods (nine methods in total). Below is a detailed description of the PPI prediction methods:

I. PPI prediction only by DI values calculated between conserved blocks of each protein pair, which includes protein pairs with significant DI between their conserved blocks (only high DI values).

II. PPI prediction only by MI-based ARACNE results, which includes protein pairs with high MI between expression values of their genes (only high MI values).

III. PPI prediction only by MIC-based ARACNE results, which includes protein pairs with high MIC between expression values of their genes (only high MIC values).

IV. PPI prediction only by codon usage similarity between their genes, which includes protein pairs with significant similarity of codon usage between their genes (high p-value of FET).

V. PPI prediction by integration of MI-based ARACNE results and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI), and high MI between expression of their genes (high MI).

VI. PPI prediction by integration of MIC-based ARACNE results and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI) and high MIC between expression of their genes (high MIC).

VII. PPI prediction by integration of codon usage similarity and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI) and significant similarity of codon usage between their genes (high p-value of FET).

VIII. PPI prediction by integration of MI-based ARACNE results, codon usage similarity and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI), significant similarity of codon usage between their genes (high p-value of FET) and high MI between their gene expressions (high MI).

IX. PPI prediction by integration of MIC-based ARACNE results, codon usage and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI), significant similarity of codon usage between their genes (high p-value of FET) and high MIC between their gene expressions (high MIC).

At the end, the performances of these methods are evaluated, yielding the best method.

Validation of results

HPRD is a very useful and common database for evaluation of performance in PPI prediction methods. We used HPRD as a reliable gold standard database to validate and make a comparison between results of the above-mentioned nine different PPI prediction methods.

RESULTS

After three steps of pre-processing the gene expression dataset, four plots were produced by the ComBat software to ascertain if the primary assumptions were correct (Fig. 5). As it is inferred from these plots, primary assumptions were established and hence application of the pre-processed dataset was permitted for the next steps. In addition, we checked the quality of the codon usage similarity results using the Human Leukocyte Antigen (HLA) gene family located on chromosome 6. There were a few HLA genes in our dataset; these genes are good positive controls for quality test of codon usage similarity. A pair of genes belonging to the HLA family is expected to have similar scores for the codon usage similarity test. Results of FET for the HLA genes in our dataset indicated high similarity between these genes, and therefore, FET results of appropriate quality are expected for other genes (Table 3). Having found out the optimum point in the ROC curve, the DI statistical significant threshold was considered 0.2 as described previously (Fig. 6). Subsequently, each PPI prediction falls into one of the four possible outcomes described below, based on which the validation and evaluation of PPIccc are realized.

Fig. 5.

Checking for primary assumptions of distribution in pre-processed dataset using ComBat. (a) Additive batch parameters for all genes which have normal distribution. (b) Multiplicative batch parameters for all genes which have gamma distribution. (c) Q-Q plot for Additive batch parameters of all genes. (d) Q-Q plot for multiplicative parameters of all genes.

Table 3. FET results for HLA gene family
Gene 1Gene 2Similarity test result
HLA-AHLA-Bsimilar
HLA-AHLA-Csimilar
HLA-AHLA-DMBsimilar
HLA-AHLA-Esimilar
HLA-AHLA-Fsimilar
HLA-AHLA-Gsimilar
HLA-CHLA-DMBsimilar
HLA-DMBHLA-Esimilar
HLA-DMBHLA-Gsimilar
Fig. 6.

ROC plot for determination of DI statistical significance threshold. Arrow shows the optimum point of curve. Optimum point of a saturated curve is the point where gradient of curve in that point is equal to 5% maximum gradient of the curve. DI values ≥ 0.2 are significant.

• True Positive (TP): Number of correct PPI predictions (or correctly accepted PPI predictions).

• True Negative (TN): Number of correct predictions of non-interacted proteins (or correctly rejected PPI predictions).

• False Positive (FP): Number of non-interacted proteins which have been falsely predicted as interacted proteins (or incorrectly accepted PPIs).

• False Negative (FN): Number of interacted proteins which have been falsely predicted as non-interacted proteins (or incorrectly rejected PPIs).

Confusion matrix, which is a specific table that allows evaluation of performance of a method (Stehman, 1997), was computed for PPIccc and other possible methods (Table 4).

Table 4. Confusion matrix for nine methods
ClassesActual classesmethod of integration
PPINon-PPI
Predicted classesPPI12701 (TP)10245714 (FP)DI-only
Non-PPI3062 (FN)16088693 (TN)
Predicted classesPPI15324 (TP)12543856 (FP)MI-only
Non-PPI439 (FN)13790551 (TN)
Predicted classesPPI14789 (TP)11066274 (FP)MIC-only
Non-PPI974 (FN)15268133(TN)
Predicted classesPPI14906 (TP)11329717 (FP)FET-only
Non-PPI857 (FN)15004690 (TN)
Predicted classesPPI14084 (TP)6988155 (FP)MI + DI
Non-PPI1679 (FN)19346252 (TN)
Predicted classesPPI13906 (TP)6330248 (FP)MIC + DI
Non-PPI1857 (FN)20004159 (TN)
Predicted classesPPI12582 (TP)5286265 (FP)FET + DI
Non-PPI3181 (FN)21048142 (TN)
Predicted classesPPI12945 (TP)5756916 (FP)MI + FET + DI
Non-PPI2818 (FN)20577491 (TN)
Predicted classesPPI10984 (TP)4987265 (FP)MIC + FET + DI (PPIccc)
Non-PPI4779 (FN)21347142 (TN)

DI-only means PPI prediction only by DI values; MI-only means PPI prediction only by gene co-expression using MI-based ARACNE; MIC-only means PPI prediction only by gene co-expression using MIC-based ARACNE; FET-only means PPI prediction only by gene co-expression using codon usage similarity; MI + DI means PPI prediction by integration of DI values and gene co-expression using MI-based ARACNE; MIC + DI means PPI prediction by integration of DI values and gene co-expression using MIC-based ARACNE; FET + DI means PPI prediction by integration of DI values and gene co-expression using codon usage similarity; MI + FET + DI means PPI prediction by integration of DI values, gene co-expression using MI-based ARACNE and codon usage similarity; MIC + FET + DI means PPI prediction by integration of DI values, gene co-expression using MIC-based ARACNE and codon usage similarity.

The protein-protein interaction networks are very sparse; the number of interacting protein pairs is much less than all possible protein pairs in a proteome. Hence, interaction data is highly imbalanced and can impose an unwanted bias in the classification problem. In these situations, we should use various performance measures to assess the classification results. In the imbalanced data, the f-measure provides more insight into the functionality of a classifier than the other metrics, such as the accuracy metric (He and Garcia, 2009). Recall shows the portion of the real PPIs that have been correctly detected by the methods, but Specificity shows the ratio of correctly recognized negative results. These metrics are calculated as below:   

Recall= Tp TP+FN , (5)
  
Specificity= TN TN+FP , (6)

The highest value for Recall (0.97) and specificity (0.81) were related to MI-only and PPIccc, respectively. Another important performance metric is accuracy which denotes the amount of closeness between the predicted results and the actual (true) result. We used accuracy for the final decision and selection between different methods of PPI prediction. The highest accuracy was related to PPIccc (0.81), but it was marginally (Fig. 7). Because of imbalance in the dataset, accuracy and specificity were almost the same for different methods. This metric is calculated by the following formula:   

Accuracy= TP+TN TP+TN+FP+FN , (7)
Fig. 7.

Comparison of accuracy between different methods. The highest value of accuracy is related to our proposed method (PPIccc).

Another metric is the geometric mean (GM) of true positive and true negative rates, which indicates the accuracy of true and false predictions with a good balance. PPI prediction by the MIC + DI method had the highest GM (0.82). The F-measure as another metric for evaluation of PPI prediction was the highest for PPIccc. The GM and f-measure can be described as:   

GM =  Specificity.recall , (8)
  
F-measure= 2( Recall.precision ) Recall+Precision , (9)

The above metric indices were determined so as to evaluate the performance of the results (Table 5). Based on the performance reported in this table, a combination of gene co-expression (using MIC-based ARACNE), codon usage similarity and DI value, altogether designated PPIccc, can predict PPI with a high performance.

Table 5. Performance metrics for nine methods of PPI prediction
Method No.method of PPI predictionRecallSpecificityGMF-measure
IDI-only0.810.610.700.002
IIMI-only0.970.520.710.002
IIIMIC-only0.940.580.740.003
IVFET-only0.950.570.730.003
VMI + DI0.890.730.810.004
VIMIC + DI0.880.760.820.004
VIIFET + DI0.790.80.790.004
VIIIMI + FET + DI0.820.780.800.004
IXMIC + FET + DI (PPIccc)0.700.810.750.005

DISCUSSION

Function of a protein depends on its interactions with other proteins. PPI detection is of the utmost importance in understanding and elucidating the regulatory mechanisms in cellular processes, such as DNA replication, transcription and metabolic pathways.

In this study, we have suggested a different method, named PPIccc, which integrates different information from gene co-expression using MIC-based ARACNE, codon usage-based gene similarity and mutual constraint in surface residues of protein sequences, for a more accurate PPI prediction (accuracy = 0.81). This information can be gained for each protein with a known sequence, the gene sequence of which and its expression value are also known. These types of information have a higher ability to predict PPI when compared to other non-integrative methods with a high rate of false positives (Yu et al., 2010). The MIC-based ARACNE is used for prediction of co-expressed genes in PPIccc, but the MI-based ARACNE can also produce acceptable results (accuracy = 0.78). As the latter algorithm has low computational costs when compared to the MIC-based ARACNE, then, if there are limitations in the computational hardware, MI can be applied instead of MIC for recognition of gene co-expression. Furthermore, PPIs can also be predicted, almost the same as the PPIccc method (accuracy = 0.80), by only considering codon usage similarity and mutual constraint of protein surface residues (only by sequence-based features) with appropriate performance.

As shown in Table 4, the proposed method has many false positive results (4987265), but it can detect more than 81% of all negative interactions (4779+21347142)/26350170. Therefore, our method is useful in filtering out considerable amounts of non-interaction from all possible protein pairs and it can substantially reduce the experimental costs in testing protein pairs to detect interaction.

Such conclusion can be also inferred based on the accuracy of the DI-only, MIC-only, FET-only methods (0.61, 0.58, and 0.57, respectively). The performance of the DI-only and FET-only (sequence-based) methods is almost similar. They are also similar, based on accuracy, to the MIC-only method. The importance of this conclusion is that the extraction of these sequence-based features is very simple and inexpensive, because they do not need expensive experiments. Hence, if there are limitations in experimental data for gene co-expression, the sequence-based features can be useful in predicting PPIs.

CONCLUSION

The descriptors which are used to encode each protein are extracted from three different types of protein information. These descriptors are critical to protein function and PPI detection, and include genomic context information (codon usage similarity), expression-based information (gene co-expression values) and structure-based information (conservation of surface residues). In addition to having comprehensive information about the protein with regard to descriptors, the conservation of surface residues is a novel feature in the PPI prediction problem.

Our results confirmed the high potential of combining gene co-expression information, codon usage similarity and mutual constraint in surface residues of proteins, to improve PPI prediction. We proposed a more robust PPI prediction method, designated PPIccc, which involves integration of two different types of data; experimental data (gene co-expression) and sequence information (codon usage and protein blocks of surface residues). It is expected that a combination of experimental and sequence information is useful in enhancing the performance of PPI prediction, because it encloses various informative aspects of the data and decreases the number of false positives in PPI prediction. Sequence-based information (codon usage similarity and mutual constraint of protein surface residues) can significantly improve PPI prediction by taking gene expression values into account. Sequence-based features are useful for PPI prediction without considering gene expression values. Other biological networks, in particular tissue specific networks, and other protein features in machine-learning schemes can be evaluated in future studies.

ACKNOWLEDGMENTS

We greatly appreciate all the people who collaborated with us in this project, especially Dr. Javad Zahiri and Dr. Ali Najafi for their kind assistance, and Mr. Vahid Ashrafiyan for his help in programming. We are also very grateful for the support provided by the administrator of the computational cluster at the Iranian Institute of Research in Fundamental Sciences (IPM).

SUPPLEMENTARY FILES

The code script for exon extraction from the whole human genome, coding sequence of human genes, MIC matrix, results of MI and MIC-based ARACNE, the code script of real score for the codon usage similarity test, codon usage similarity results using FET, DCA code script, gene expression file after pre-processing, RSARF results for the detection of surface residues of human protein sequences, used protein blocks for DCA, and DI scores are available at this address: http://cbp.ut.ac.ir/PPIcococo/.

REFERENCES
 
© 2014 by The Genetics Society of Japan
feedback
Top