Protein-protein interaction prediction by combined analysis of genomic and conservation information

Abbasali Emamjomeh; Bahram Goliaei; Ali Torkamani; Reza Ebrahimpour; Nima Mohammadi; Ahmad Parsian

doi:10.1266/ggs.89.259

ABSTRACT

Protein-protein interactions (PPIs) are highly important because of their main role in cellular processes and biochemical pathways; therefore, PPI can be very useful in the prediction of protein functions. Experimental techniques of PPI detection have certain drawbacks; hence computational methods can be used to complement wet lab techniques. Such methods can be applied to PPI prediction as well as validation of experimental results. Computational algorithms can lead to many false PPI predictions, which in turn result in non-adequate performance. We have developed a novel method based on combined analysis, entitled PPIccc. Three different descriptors for PPIccc included gene co-expression values, codon usage similarity and conservation of surface residues between protein products of a gene pair, which combined to predict PPI. Validation of results based on Human Protein Reference Database (HPRD) indicated improvement of performance in our proposed method. The results also revealed that conservation of surface residues between proteins in combination with codon usage similarity of their related genes increase the performance of PPI prediction. This means that codon usage similarity and surface residues between proteins (only sequence-based features) can predict PPIs as good as PPIccc.

INTRODUCTION

Genome sequencing projects have already been completed for a large number of species and soon the complete genome sequence of many more species will be added to this list. These genomic data have led to the emergence of new insights into identifying functional processes in biological macromolecules, such as proteins. Systems biology, with particular emphasis on network reconstruction, plays a crucial role in the discovery of the biological functions of such types of macromolecules (Franzosa et al., 2009). Network reconstruction has different applications in the analysis of biological data, e.g. such techniques have been used for the analysis of gene expression data in co-expression networks (Torkamani et al., 2010), identification of certain trends in mutational data (Torkamani and Schork, 2009), recognition of differences between biological information and disease pathways across species (Miller et al., 2010), and identification of regulatory networks and transcriptional relationships (Wang et al., 2009). In all biological networks, correlation patterns are used to infer genetic relationships. The basic logic for reconstruction of these types of biological networks is that, correlation patterns indicate common relationships among biological elements; therefore, a biological relationship can be inferred based on such correlations. Protein-protein interaction (PPI) is one of the most important biological relationships because of the significance of this macromolecule in all living organisms. This importance is not only due to their individual activities but also because of their specific interactions with other proteins (Sharon et al., 2009). Indeed, some components of protein complexes cannot be used in the cell unless they are in contact with other components of the complex (Zhang et al., 2004). Detection of PPIs have various perspectives in biology, for example in drug design (Wells and McClendon, 2007), map building of signaling pathways in the cells so as to better understand signal transduction in physiopathological processes (Pawson and Nash, 2000), predicting PPIs between species to find therapeutic strategies (Emamjomeh et al., 2014) and prediction of protein functions (Hou and Chi, 2012). Therefore, PPI network reconstruction will be highly useful in gaining a better understanding of molecular mechanisms in cells (Theofilatos, 2011).

There are different in vitro techniques for detection of physical PPIs. Essentially, two main categories of in vitro techniques are used to recognize PPI in wet-lab (Shoemaker and Panchenko, 2007b): low-throughput (e.g. X-ray crystallography, fluorescence resonance energy transfer, surface plasmon resonance and atomic force microscopy) and high-throughput (e.g. yeast two-hybrid system, affinity purification-mass spectroscopy, DNA microarray, protein chips, synthetic lethality and phage display) approaches. There is a gap between experimentally- detected PPIs and real ones (Zahiri et al., 2013a). On the other hand, there are some shortcomings in the results of PPI prediction using in vitro methods. For example, bias makes PPI predictions more inclined toward certain specific proteins such as globular proteins. Furthermore, in vitro methods can usually recognize permanent PPIs and therefore, cannot detect all PPIs (Zahiri et al., 2013b). Generally, network reconstruction approaches are very successful in unveiling regulatory relationships and other interesting biological phenomena, but they may lead to a large number of false positive interactions (Mahdavi and Lin, 2007). It has been shown that improvement of performance in such methods can be achieved by some sorts of computational methods (Rhodes et al., 2005), hence the incentive for the emergence of such methods to predict PPIs. The computational methods are regarded as a complement to the in vitro methods; in fact, combination of experimental and computational methods can outperform PPI predicted using each method, because of the reduction in the rate of false-positive generation (Mahdavi and Lin, 2007; Shoemaker and Panchenko, 2007a). There are different classes of the PPI prediction methods:

1. Machine learning-based methods; including random forest (Chen and Liu, 2005), support vector machines (SVM) (Ben-Hur and Noble, 2005; Lo et al., 2005; Shen et al., 2007), naïve Bayes (Lu et al., 2005) and multilayer perceptron (MLP) (Keedwell and Narayanan, 2005). Different sequence and non-sequence-based features are used for learning these methods.

2. Genomic context and structure of proteins-based methods; for example gene co-expression (Ideker et al., 2002), three-dimensional structural information (Aloy et al., 2004; Aloy and Russell, 2003), gene neighboring (Ideker et al., 2002), gene fusion (Enright et al., 1999) and phylogenetic relationships (Jothi et al., 2005).

3. Network topology-based methods (Chen et al., 2006; Liu et al., 2008).

4. Text mining or literature mining-based methods (Jaeger et al., 2008; Oyama et al., 2002).

Regardless of the above-mentioned methods, codon usage can also predict PPI. In fact, codon usage similarity between two genes can be applied to recognize co-expressed genes in yeast (Jansen et al., 2003). It goes without saying that co-expressed genes also have similar synonymous codons in human and some other living organisms (Najafabadi et al., 2009). It is also stated that codon usage of functionally and physically interacting proteins in a living organism is informative in predicting PPI (Najafabadi and Salavati, 2008). The evidence to this claim is that codon usage of interacting protein pairs differs significantly from that of randomly chosen ones (Zhou et al., 2012). This relationship may be due to function-specific codon usage, which is based on selective charging of the tRNA isoacceptors (Elf et al., 2003). The relationship has also been confirmed experimentally (Dittmar et al., 2005).

We know that physical PPI is created by surface amino acid residues between two proteins (Fraser et al., 2004), and all residue combinations are not equally acceptable as the contact residues in PPI (Lunt et al., 2010). Structural details of PPI have been conserved between homologous proteins in related species, and such surface residues are constrained at the interface between protein pairs (Lunt et al., 2010). Hence, PPIs can be predicted by the identification of mutually-constrained surface residues between interacting protein partners (Schug et al., 2009; Szurmant et al., 2008). This suggests that a PPI prediction approach can be possible using mutually-constrained residues (Procaccini et al., 2011); i.e. if two proteins share mutually-constrained residues across multiple species, then it can be inferred that those two proteins have interaction. We should consider that the correct detection of mutually-constrained residues between interacting protein partners cannot be accomplished by only one species (Morcos et al., 2011).

Reconstruction of biological networks can be achieved by different types of data. In our proposed method, we have used integration of data related to different levels of central dogma, to reconstruct PPI networks (Chen et al., 2001). In this work, we have relied upon PPI prediction using similarity between two genes in the light of their gene co-expression values, codon usage and identifying mutually-constrained surface residues between protein products of those two genes. Indeed, gene pairs with a high degree of gene expression correlation and codon usage similarity in addition to mutually-constrained residues across related species can be excellent candidates for PPI prediction. Three descriptors, representing different properties of a protein that have been shown very important in protein function and PPI detection, were used to characterize the protein. Genomic context information (codon usage similarity), Transcription-based information (gene co-expression values), and Structure-based information (conservation of surface residues between protein products of a gene pair). In addition to having comprehensive information about the protein with these descriptors, the conservation of surface residues is a novel feature in PPI prediction problem.

The aim of this work is PPI prediction where these three types of data are taken into account. Accordingly, this study has led to the development of a novel method called PPI prediction by integration of Co-expression, Codon usage and Conservation data (PPIccc, pronounced: PPI triple c), which can predict PPIs using integration of three descriptors (gene expression data, codon usage analysis and conserved regions of protein surface residues).

MATERIALS AND METHODS

It is expected that integration of gene expression values, codon usage similarity analysis and information related to evolutionary conserved regions in the surface residues of proteins can be applicable to high-performance PPI prediction (Fig. 1A). This article relied upon three main steps: determination of gene co-expression on a pre-processed gene dataset, computation of codon usage-based gene similarity for the same pre-processed gene dataset, and calculation of mutually-constrained between surface residues of their related proteins. At the final step, PPIs is predicted using five integrated and four non-integrated methods (nine methods in total). Then, performances of the nine methods are evaluated by a gold standard database. Human Protein Reference Database or HPRD (Prasad et al., 2009) has been used as a reference database for validation of the results. We used the PPIs that were confirmed by at least two different experimental methods in this database (Fig. 1B). It includes 15763 out of a total of 39,240 interactions; furthermore, the total number of proteins was 5632.

Fig. 1.

Overview of PPIccc method. (a) Usage of different levels of central dogma for PPI network reconstruction. Integration of codon usage similarity analysis with co-expressed genes (identified using microarray data) and mutually-constrained surface residues for each gene pair were applied to predict PPI. It finally leads to PPI network reconstruction. (b) HPRD database was applied for validation of results and discovery of accurate PPI.

Extracting and pre-processing of datasets

At the first step, the raw files (.CEL files) related to five melanoma datasets (namely, GSE8401, GSE22083, GSE12445, GSE12627 and GSE9118) consisting of 298 samples, were extracted from GEO/NCBI (Eskandarpour et al., 2009; Harlin et al., 2009; Muthusamy et al., 2006; Tock et al., 2011; Xu et al., 2008). These datasets were selected because of similarity in their experimental conditions. The microarray platform of the melanoma datasets was GPL96 or [HG-U133A] Affymetrix Human Genome U133A Array (Fig. 2A). Pre-processing steps must be performed to combine these datasets to remove statistical biases and then construct a corrected and combined dataset (Sims et al., 2008). The pre-processing phase consisted of three steps with regard to the raw files (.CEl files):

Fig. 2.

Pre-processing steps for microarray datasets. (a) Extracting of .CEL files related to 5 melanoma datasets. (b) Removing of bad probe-sets using cleaner1.03 (c) Data normalization and Presence/Absence calls (P/A calls) using MAS5.0 (d) Batch effect removal by ComBat. (e) Present/Absent genes after pre-processing steps.

a) Probe cleaning by Cleaner1.03 for the purpose of probe filtering and removal of low quality probe-sets (Alvarez et al., 2009) (Fig. 2B).

b) Data normalization (Hubbell et al., 2002) and Presence/Absence calls (P/A calls) for detection of present and absent genes (Warren, 2010) using MAS5.0 (Fig. 2C).

c) Removal of batch effects by ComBat (Johnson et al., 2007) (Fig. 2D). The final dataset consisted of expression values related to 7569 human genes (Fig. 2E).

Acquisition of co-expressed genes

Algorithm for the Reconstruction of Accurate Cellular Networks or ARACNE (Margolin et al., 2006) was run on the pre-processed melanoma dataset as the input file, because it has low false-positive rate. In this package, co-expressed genes are recognized after computation of mutual information (MI) between each gene pair and removal of the weakest interaction in each gene triplet. MI is a good metric for determination of expression similarity between two genes and is preferred to the Pearson correlation coefficient, since this coefficient can reveal only linear and direct relationships (Daub et al., 2004). Mutual information is calculated for X and Y variables as:

I( X;Y ) = ∑ i,j P( x i , y j ) log P( x i , y j ) P( x i )P( y j ) ,

(1)

Where, P(x_i, y_j) and P(x_i) or P(y_j) are joint and marginal probabilities for the expression values of genes X and Y, respectively. Maximal information coefficient (MIC) has been suggested as another metric to determine expression similarity between two genes (Reshef et al., 2011). MIC can evaluate all types of functional relationships between two genes, and also give similar scores for equally noisy relationships. We computed the MIC matrix between each gene pair by using the maximal information-based non-parametric exploration (MINE) software. This MIC matrix was also used as an input file for ARACNE. The ARACNE parameters were adjusted on these values: Kernel width using the accurate method = 0.15, MI threshold = 0.04 and DPI = 0.01. At the end of this step, we produced two output files for gene co-expression using ARACNE; a file consisting of co-expressed genes based on the MI and another file including co-expressed genes based on the MIC. At the final step, these two sorts of gene co-expression matrices were used for PPI prediction.

Computation of codon usage similarity between genes

This phase consisted of four steps (Fig. 3): a) Extracting coding region sequences for human genes; b) Real-score computation for codon usage similarity of two genes using Fisher exact test (FET) (Conniffe, 1991) and the Fisher combined probability test (Fisher’s method); c) Final p-value calculation by iterated simulation of human gene sequences and; d) Final comparison of codon usage similarity between human genes or statistical significant test.

Fig. 3.

Overview of Computation of codon usage similarity between two genes. (a) Extraction of coding sequences of human genes. (b) Real score calculation for codon usage similarity of two genes using FET. (c) Iterated substitution of human genes' sequences. (d) Calculation of final p-value for comparison of two genes as codon usage. (e) Overall flowchart for computation of final p-value to compare two genes in the light of codon usage.

Extraction of coding sequences of human genes

Coding region sequences of human genes are extracted for codon usage similarity analysis of each gene pair. For this purpose, the whole genome of Homo sapiens and its related known gene file containing the entire information of human genes were downloaded from the genome browser at the University of California, Santa Cruz (UCSC) (Kent et al., 2002) (Fig. 3A). Afterwards, coding region sequences of human genes are trimmed based on known gene file information, e.g. chromosomal location, strand orientation, and number of exons and the start/end points of exons.

Real score for codon usage similarity of two genes

FET is a statistical test for the analysis of contingency tables with small sample size, and can be used for the codon usage similarity test of two genes (Plotkin et al., 2004). We used FET to calculate the real score for codon usage similarity of two genes (Fig. 3B). In this test, p-values are calculated separately for each amino acid regarding absolute frequency of synonymous codons between two genes, and finally all p-values related to whole amino acids of two genes are combined by Fisher’s method as follows:

Combined p-value =-2 ∑ i=1 k ln( p i ) ,

(2)

where, p_i and k are p-value and total number of amino acids related to synonymous codons, respectively. The combined p-value is considered as the real score between two genes (Table 1).

Table 1. Calculation of combined p-value using FET and Fisher’s method for comparison of two putative genes in the light of codon usage similarity

Codon	Gene 1	Gene 2	Amino acid	P-value	Combined P-value
CUU	12	8	Leu	0.071	2 ∑ i=1 k ln( p i ) =31.57
CUC	7	12
CUA	3	14
CUG	7	10
AUU	4	7	Ile	0.049
AUC	16	7
AUA	15	24
AAU	10	25	asn	4.00E-05
AAC	22	5	asn	4.00E-05

Iterated substitution of human genes sequences

After computation of the real score for codon usage similarity of each gene pair, a large amount of random scores were generated for codon usage similarity of each gene pair (Fig. 3C). Thereafter, we produced numerous putative DNA sequences for each gene by substitution of synonymous codons corresponding to each amino acid without changing the protein sequence of the original gene. The DNA sequence simulation was based on the assumption that no change occurs in the protein sequence related to each transcript. The parameter N, set to 10⁶, indicated the amount of random sequences that were considered to be sufficient for our sequence simulation. The combined p-value was calculated for each random generated gene sequence using FET and the Fisher’s method (i.e. random scores). Subsequently, the number of times, n, that random scores were greater than the real score for each gene pair was counted (Fig. 3D). This process is depicted in a schematic description (Fig. 3E). The final p-value for each gene pair was calculated as follows (with a pseudo count of 1):

Final p-value= ( 1+n ) ( 1+N ) ,

(3)

Where, N and n were described as mentioned above. Most of the implemented methods take advantage of parallelization techniques, and were performed using the computational cluster provided by the high performance computing (HPC) cluster in the Iranian Institute of Research in Fundamental Sciences (Tehran), comprised of ~400 computational cores. Due to the enormous size of the data and limited resources, the tasks took approximately six months to complete.

Final comparison of codon usage similarity

Null and alternative hypotheses (H₀ and H₁) of the codon usage similarity test indicated codon usage similarity and non-similarity of two genes, respectively. If the final p-value > 0.05 (high p-values, accepting H₀), it means that the two genes have similar codon usage, and if the final p-value ≤ 0.05 (low p-values, rejecting H₀), we can conclude that the two genes have different codon usage.

Calculating mutually-constrained conservation

It is necessary to use homologous proteins across related species for such prediction. It should also be mentioned that the correct identification of mutually-constrained surface residues between two proteins is based on known protein interactions; however, we need to computationally predict the likely residues involved in PPI.

For this purpose, total homologous sequences of each human protein related to nine available animal species (Table 2) were obtained from the publicly available HomoloGene-NCBI database. Various steps were performed for the calculation of this stage (Fig. 4). These human protein sequences are the products of the pre-processed gene dataset in the previous step. The multiple sequence comparison by log- expectation (MUSCLE) algorithm (Edgar, 2004) was then run for the purpose of multiple sequence alignment (MSA) between each human protein sequence and its homologous sequences (Fig. 4A). On the other hand, the surface residues of the human proteins were detected by prediction of solvent accessibility from protein sequence using random forest method or RSARF (Fig. 4B) (Pugalenthi et al., 2012). Furthermore, conserved blocks of surface residues were detected between each human protein sequence and its homologous sequences for each MSA (Fig. 4C). Accordingly, all possible pairs of conserved blocks were concatenated (Weigt et al., 2009) for similarity analysis regarding surface residues of their proteins (Fig. 4D). Finally, direct coupling analysis (DCA) was carried out for all concatenated conserved blocks and direct coupling (DI) values were calculated for all positions in each of them (Fig. 4E). DI values were calculated as follows (Lunt et al., 2010):

D I ij =∑ ( A i , A j ) P ij ( dir ) ( A i, A j ) ln P ij ( dir ) ( A i, A j ) f i( A i ) f j( A j ) ,

(4)

where DI_ij, f_i(A_i) and f_j(A_j), and P_ij⁽^dir⁾(A_i, Aj) are DI values between the i^th and j^th positions of homologous protein sequences, frequencies of residue A in the i^th and j^th positions of homologous protein sequences, and direct pair distribution which is related to two coupled variables with unique direct links, respectively.

Table 2. Species used in Homologene database

Scientific name	English name	NCBI taxonomy ID
Mus musculus (M. musculus)	House mouse	10090
Rattus norvegicus (R. norvegicus)	Rat	10116
Danio rerio (D. rerio)	Zebrafish	7955
Gallus gallus (G. gallus)	Red junglefowl	9031
Macaca mulatta (M. mulatta)	Rhesus macaque	9544
Pan troglodytes (P. troglodytes)	Common chimpanzee	9598
Homo sapiens (H. sapiens)	Human	9606
Canis lupus familiaris (C. lupus)	Dog	9615
Bos Taurus (B. taurus)	Cattle	9913

Fig. 4.

An overview for calculating of mutually-constrained conservation. (a) Finding of homologous sequences of human proteins in its relatives using Homologene/NCBI database. (b) Determination of surface residues in each human protein using RSARF program. (c) Determination of conserved blocks in surface residues between each human protein and its homologous sequences in human’s relatives. (d) Concatenation of each conserved blocks pair of surface residues related to two proteins. (e) Calculation of DI for each concatenated blocks using DCA.

The DI value measures the direct coupling amount between two amino acid residues (positions i and j in a conserved block), and indeed it is the aspect of MI arising from the direct coupling information of two surface residues. We considered the maximum DI value among all residues of two proteins as the DI between the surface residue blocks of the same two proteins. DI can be considered analogous to MI; high (significant) values of DI or powerful direct coupling between two residues related to a protein pair might be exploited for prediction of physical contact for the same protein pair. To discover the proper statistical significance and determination of a threshold for DI, ROC curve is plotted which serves to visually illustrate sensitivity/specificity trade-off at varied thresholds (Fawcett, 2006). The ROC curve was drawn for true and false positive rate of different significant DI thresholds ranging from 0 to 1 (0.01, 0.02,… 0.99, 1). Then, we determined the optimum point of ROC in such a way that gradient of the ROC plot at the optimum point is equal to 5% of the maximum gradient of the curve. On the other hand, the optimum point of the ROC plot which contains the high true positive rate with a low false positive rate is used to determine the DI statistical significance threshold. In other words, threshold related to the optimum point in the ROC plot is considered as the statistical significant threshold of DI.

PPI prediction methods

Finally, we predicted PPIs using five methods involving different integration of gene co-expression data, codon usage similarity of genes and mutual constraint of surface residues for proteins. We also predicted PPIs using four non-integrated methods (nine methods in total). Below is a detailed description of the PPI prediction methods:

I. PPI prediction only by DI values calculated between conserved blocks of each protein pair, which includes protein pairs with significant DI between their conserved blocks (only high DI values).

II. PPI prediction only by MI-based ARACNE results, which includes protein pairs with high MI between expression values of their genes (only high MI values).

III. PPI prediction only by MIC-based ARACNE results, which includes protein pairs with high MIC between expression values of their genes (only high MIC values).

IV. PPI prediction only by codon usage similarity between their genes, which includes protein pairs with significant similarity of codon usage between their genes (high p-value of FET).

V. PPI prediction by integration of MI-based ARACNE results and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI), and high MI between expression of their genes (high MI).

VI. PPI prediction by integration of MIC-based ARACNE results and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI) and high MIC between expression of their genes (high MIC).

VII. PPI prediction by integration of codon usage similarity and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI) and significant similarity of codon usage between their genes (high p-value of FET).

VIII. PPI prediction by integration of MI-based ARACNE results, codon usage similarity and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI), significant similarity of codon usage between their genes (high p-value of FET) and high MI between their gene expressions (high MI).

IX. PPI prediction by integration of MIC-based ARACNE results, codon usage and mutual constraint of surface residues for proteins, which includes protein pairs with significant similarity in mutual constraint of their surface residues (high DI), significant similarity of codon usage between their genes (high p-value of FET) and high MIC between their gene expressions (high MIC).

At the end, the performances of these methods are evaluated, yielding the best method.

Validation of results

HPRD is a very useful and common database for evaluation of performance in PPI prediction methods. We used HPRD as a reliable gold standard database to validate and make a comparison between results of the above-mentioned nine different PPI prediction methods.

RESULTS

After three steps of pre-processing the gene expression dataset, four plots were produced by the ComBat software to ascertain if the primary assumptions were correct (Fig. 5). As it is inferred from these plots, primary assumptions were established and hence application of the pre-processed dataset was permitted for the next steps. In addition, we checked the quality of the codon usage similarity results using the Human Leukocyte Antigen (HLA) gene family located on chromosome 6. There were a few HLA genes in our dataset; these genes are good positive controls for quality test of codon usage similarity. A pair of genes belonging to the HLA family is expected to have similar scores for the codon usage similarity test. Results of FET for the HLA genes in our dataset indicated high similarity between these genes, and therefore, FET results of appropriate quality are expected for other genes (Table 3). Having found out the optimum point in the ROC curve, the DI statistical significant threshold was considered 0.2 as described previously (Fig. 6). Subsequently, each PPI prediction falls into one of the four possible outcomes described below, based on which the validation and evaluation of PPIccc are realized.

Fig. 5.

Checking for primary assumptions of distribution in pre-processed dataset using ComBat. (a) Additive batch parameters for all genes which have normal distribution. (b) Multiplicative batch parameters for all genes which have gamma distribution. (c) Q-Q plot for Additive batch parameters of all genes. (d) Q-Q plot for multiplicative parameters of all genes.

Table 3. FET results for HLA gene family

Gene 1	Gene 2	Similarity test result
HLA-A	HLA-B	similar
HLA-A	HLA-C	similar
HLA-A	HLA-DMB	similar
HLA-A	HLA-E	similar
HLA-A	HLA-F	similar
HLA-A	HLA-G	similar
HLA-C	HLA-DMB	similar
HLA-DMB	HLA-E	similar
HLA-DMB	HLA-G	similar

Fig. 6.

ROC plot for determination of DI statistical significance threshold. Arrow shows the optimum point of curve. Optimum point of a saturated curve is the point where gradient of curve in that point is equal to 5% maximum gradient of the curve. DI values ≥ 0.2 are significant.

• True Positive (TP): Number of correct PPI predictions (or correctly accepted PPI predictions).

• True Negative (TN): Number of correct predictions of non-interacted proteins (or correctly rejected PPI predictions).

• False Positive (FP): Number of non-interacted proteins which have been falsely predicted as interacted proteins (or incorrectly accepted PPIs).

• False Negative (FN): Number of interacted proteins which have been falsely predicted as non-interacted proteins (or incorrectly rejected PPIs).

Confusion matrix, which is a specific table that allows evaluation of performance of a method (Stehman, 1997), was computed for PPIccc and other possible methods (Table 4).

Table 4. Confusion matrix for nine methods

Classes		Actual classes		method of integration
Classes		PPI	Non-PPI	method of integration
Predicted classes	PPI	12701 (TP)	10245714 (FP)	DI-only
Predicted classes	Non-PPI	3062 (FN)	16088693 (TN)	DI-only
Predicted classes	PPI	15324 (TP)	12543856 (FP)	MI-only
Predicted classes	Non-PPI	439 (FN)	13790551 (TN)	MI-only
Predicted classes	PPI	14789 (TP)	11066274 (FP)	MIC-only
Predicted classes	Non-PPI	974 (FN)	15268133(TN)	MIC-only
Predicted classes	PPI	14906 (TP)	11329717 (FP)	FET-only
Predicted classes	Non-PPI	857 (FN)	15004690 (TN)	FET-only
Predicted classes	PPI	14084 (TP)	6988155 (FP)	MI + DI
Predicted classes	Non-PPI	1679 (FN)	19346252 (TN)	MI + DI
Predicted classes	PPI	13906 (TP)	6330248 (FP)	MIC + DI
Predicted classes	Non-PPI	1857 (FN)	20004159 (TN)	MIC + DI
Predicted classes	PPI	12582 (TP)	5286265 (FP)	FET + DI
Predicted classes	Non-PPI	3181 (FN)	21048142 (TN)	FET + DI
Predicted classes	PPI	12945 (TP)	5756916 (FP)	MI + FET + DI
Predicted classes	Non-PPI	2818 (FN)	20577491 (TN)	MI + FET + DI
Predicted classes	PPI	10984 (TP)	4987265 (FP)	MIC + FET + DI (PPIccc)
Predicted classes	Non-PPI	4779 (FN)	21347142 (TN)	MIC + FET + DI (PPIccc)

DI-only means PPI prediction only by DI values; MI-only means PPI prediction only by gene co-expression using MI-based ARACNE; MIC-only means PPI prediction only by gene co-expression using MIC-based ARACNE; FET-only means PPI prediction only by gene co-expression using codon usage similarity; MI + DI means PPI prediction by integration of DI values and gene co-expression using MI-based ARACNE; MIC + DI means PPI prediction by integration of DI values and gene co-expression using MIC-based ARACNE; FET + DI means PPI prediction by integration of DI values and gene co-expression using codon usage similarity; MI + FET + DI means PPI prediction by integration of DI values, gene co-expression using MI-based ARACNE and codon usage similarity; MIC + FET + DI means PPI prediction by integration of DI values, gene co-expression using MIC-based ARACNE and codon usage similarity.

The protein-protein interaction networks are very sparse; the number of interacting protein pairs is much less than all possible protein pairs in a proteome. Hence, interaction data is highly imbalanced and can impose an unwanted bias in the classification problem. In these situations, we should use various performance measures to assess the classification results. In the imbalanced data, the f-measure provides more insight into the functionality of a classifier than the other metrics, such as the accuracy metric (He and Garcia, 2009). Recall shows the portion of the real PPIs that have been correctly detected by the methods, but Specificity shows the ratio of correctly recognized negative results. These metrics are calculated as below:

Recall= Tp TP+FN ,

(5)

Specificity= TN TN+FP ,

(6)

The highest value for Recall (0.97) and specificity (0.81) were related to MI-only and PPIccc, respectively. Another important performance metric is accuracy which denotes the amount of closeness between the predicted results and the actual (true) result. We used accuracy for the final decision and selection between different methods of PPI prediction. The highest accuracy was related to PPIccc (0.81), but it was marginally (Fig. 7). Because of imbalance in the dataset, accuracy and specificity were almost the same for different methods. This metric is calculated by the following formula:

Accuracy= TP+TN TP+TN+FP+FN ,

(7)

Fig. 7.

Comparison of accuracy between different methods. The highest value of accuracy is related to our proposed method (PPIccc).

Another metric is the geometric mean (GM) of true positive and true negative rates, which indicates the accuracy of true and false predictions with a good balance. PPI prediction by the MIC + DI method had the highest GM (0.82). The F-measure as another metric for evaluation of PPI prediction was the highest for PPIccc. The GM and f-measure can be described as:

GM = Specificity.recall ,

(8)

F-measure= 2( Recall.precision ) Recall+Precision ,

(9)

The above metric indices were determined so as to evaluate the performance of the results (Table 5). Based on the performance reported in this table, a combination of gene co-expression (using MIC-based ARACNE), codon usage similarity and DI value, altogether designated PPIccc, can predict PPI with a high performance.

Table 5. Performance metrics for nine methods of PPI prediction

Method No.	method of PPI prediction	Recall	Specificity	GM	F-measure
I	DI-only	0.81	0.61	0.70	0.002
II	MI-only	0.97	0.52	0.71	0.002
III	MIC-only	0.94	0.58	0.74	0.003
IV	FET-only	0.95	0.57	0.73	0.003
V	MI + DI	0.89	0.73	0.81	0.004
VI	MIC + DI	0.88	0.76	0.82	0.004
VII	FET + DI	0.79	0.8	0.79	0.004
VIII	MI + FET + DI	0.82	0.78	0.80	0.004
IX	MIC + FET + DI (PPIccc)	0.70	0.81	0.75	0.005

DISCUSSION

Function of a protein depends on its interactions with other proteins. PPI detection is of the utmost importance in understanding and elucidating the regulatory mechanisms in cellular processes, such as DNA replication, transcription and metabolic pathways.

In this study, we have suggested a different method, named PPIccc, which integrates different information from gene co-expression using MIC-based ARACNE, codon usage-based gene similarity and mutual constraint in surface residues of protein sequences, for a more accurate PPI prediction (accuracy = 0.81). This information can be gained for each protein with a known sequence, the gene sequence of which and its expression value are also known. These types of information have a higher ability to predict PPI when compared to other non-integrative methods with a high rate of false positives (Yu et al., 2010). The MIC-based ARACNE is used for prediction of co-expressed genes in PPIccc, but the MI-based ARACNE can also produce acceptable results (accuracy = 0.78). As the latter algorithm has low computational costs when compared to the MIC-based ARACNE, then, if there are limitations in the computational hardware, MI can be applied instead of MIC for recognition of gene co-expression. Furthermore, PPIs can also be predicted, almost the same as the PPIccc method (accuracy = 0.80), by only considering codon usage similarity and mutual constraint of protein surface residues (only by sequence-based features) with appropriate performance.

As shown in Table 4, the proposed method has many false positive results (4987265), but it can detect more than 81% of all negative interactions (4779+21347142)/26350170. Therefore, our method is useful in filtering out considerable amounts of non-interaction from all possible protein pairs and it can substantially reduce the experimental costs in testing protein pairs to detect interaction.

Such conclusion can be also inferred based on the accuracy of the DI-only, MIC-only, FET-only methods (0.61, 0.58, and 0.57, respectively). The performance of the DI-only and FET-only (sequence-based) methods is almost similar. They are also similar, based on accuracy, to the MIC-only method. The importance of this conclusion is that the extraction of these sequence-based features is very simple and inexpensive, because they do not need expensive experiments. Hence, if there are limitations in experimental data for gene co-expression, the sequence-based features can be useful in predicting PPIs.

CONCLUSION

The descriptors which are used to encode each protein are extracted from three different types of protein information. These descriptors are critical to protein function and PPI detection, and include genomic context information (codon usage similarity), expression-based information (gene co-expression values) and structure-based information (conservation of surface residues). In addition to having comprehensive information about the protein with regard to descriptors, the conservation of surface residues is a novel feature in the PPI prediction problem.

Our results confirmed the high potential of combining gene co-expression information, codon usage similarity and mutual constraint in surface residues of proteins, to improve PPI prediction. We proposed a more robust PPI prediction method, designated PPIccc, which involves integration of two different types of data; experimental data (gene co-expression) and sequence information (codon usage and protein blocks of surface residues). It is expected that a combination of experimental and sequence information is useful in enhancing the performance of PPI prediction, because it encloses various informative aspects of the data and decreases the number of false positives in PPI prediction. Sequence-based information (codon usage similarity and mutual constraint of protein surface residues) can significantly improve PPI prediction by taking gene expression values into account. Sequence-based features are useful for PPI prediction without considering gene expression values. Other biological networks, in particular tissue specific networks, and other protein features in machine-learning schemes can be evaluated in future studies.

ACKNOWLEDGMENTS

We greatly appreciate all the people who collaborated with us in this project, especially Dr. Javad Zahiri and Dr. Ali Najafi for their kind assistance, and Mr. Vahid Ashrafiyan for his help in programming. We are also very grateful for the support provided by the administrator of the computational cluster at the Iranian Institute of Research in Fundamental Sciences (IPM).

SUPPLEMENTARY FILES

The code script for exon extraction from the whole human genome, coding sequence of human genes, MIC matrix, results of MI and MIC-based ARACNE, the code script of real score for the codon usage similarity test, codon usage similarity results using FET, DCA code script, gene expression file after pre-processing, RSARF results for the detection of surface residues of human protein sequences, used protein blocks for DCA, and DI scores are available at this address: http://cbp.ut.ac.ir/PPIcococo/.

REFERENCES

Aloy, P., and Russell, R. B. (2003) InterPreTS: protein interaction prediction through tertiary structure. Bioinformatics 19, 161–162.
Aloy, P., Bottcher, B., Ceulemans, H., Leutwein, C., Mellwig, C., Fischer, S., Gavin, A.-C., Bork, P., Superti-Furga, G., Serrano, L., and Russell, R. B. (2004) Structure-based assembly of protein complexes in yeast. Science 303, 2026–2029.
Alvarez, M. J., Sumazin, P., Rajbhandari, P., and Califano, A. (2009) Correlating measurements across samples improves accuracy of large-scale expression profile experiments. Genome Biol. 10, R143.
Ben-Hur, A., and Noble, W. S. (2005) Kernel methods for predicting protein-protein interactions. Bioinformatics 21(suppl 1), i38–i46.
Chen, J., Hsu, W., Lee, M. L., and Ng, S.-K. (2006) Increasing confidence of protein interactomes using network topological metrics. Bioinformatics 22, 1998–2004.
Chen, T., Filkov, V., and Skiena, S. S. (2001) Identifying gene regulatory networks from experimental data. Parallel Comput. 27, 141–162.
Chen, X.-W., and Liu, M. (2005) Prediction of protein-protein interactions using random decision forest framework. Bioinformatics 21, 4394–4400.
Conniffe, D. (1991) R. A. Fisher and the development of statistics - a view in his centerary year. Journal of the Statistical and Social Inquiry Society of Ireland 26, 55–108.
Daub, C. O., Steuer, R., Selbig, J., and Kloska, S. (2004) Estimating mutual information using B-spline functions–an improved similarity measure for analysing gene expression data. BMC Bioinformatics 5, 118.
Dittmar, K. A., Sorensen, M. A., Elf, J., Ehrenberg, M., and Pan, T. (2005) Selective charging of tRNA isoacceptors induced by amino-acid starvation. EMBO rep. 6, 151–157.
Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797.
Elf, J., Nilsson, D., Tenson, T., and Ehrenberg, M. (2003) Selective charging of tRNA isoacceptors explains patterns of codon usage. Science 300, 1718–1722.
Emamjomeh, A., Goliaei, B., Zahiri, J., and Ebrahimpour, R. (2014) Predicting of protein–protein interactions between human and hepatitis C virus via an ensemble learning method. Mol. BioSyst. 10, 3147–3154. DOI:10.1039/c4mb00410h.
Enright, A. J., Iliopoulos, I., Kyrpides, N. C., and Ouzounis, C. A. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90.
Fawcett, T. (2006) An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874.
Franzosa, E., Linghu, B., and Xia, Y. (2009) Computational reconstruction of protein–protein interaction networks: algorithms and issues. In: Computational Systems Biology (eds.: McDermott, J., Samudrala, R., Bumgarner, R., Montgomery, K., and Ireton, R.), pp.89–100. Humana Press, New York.
Eskandarpour, M., Huang, F., Reeves, K. A., Clark, E., and Hansson, J. (2009) Oncogenic NRAS has multiple effects on the malignant phenotype of human melanoma cells cultured in vitro. Int. J. Cancer 124, 16–26.
Fraser, H. B., Hirsh, A. E., Wall, D. P., and Eisen, M. B. (2004) Coevolution of gene expression among interacting proteins. Proc. Natl. Acad. Sci. USA 101, 9033–9038.
Harlin, H., Meng, Y., Peterson, A. C., Zha, Y., Tretiakova, M., Slingluff, C., McKee, M., and Gajewski, T. F. (2009) Chemokine expression in melanoma metastases associated with CD8⁺ T-cell recruitment. Cancer Res. 69, 3077–3085.
He, H., and Garcia, E. A. (2009) Learning from imbalanced data. IEEE Trans. Knowledge and Data Eng. 21, 1263–1284.
Hou, J., and Chi, X. (2012) Predicting protein functions from PPI networks using functional aggregation. Math. Biosci. 240, 63–69.
Hubbell, E., Liu, W.-M., and Mei, R. (2002) Robust estimators for expression analysis. Bioinformatics 18, 1585–1592.
Ideker, T., Ozier, O., Schwikowski, B., and Siegel, A. F. (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18(suppl 1), S233–S240.
Jaeger, S., Gaudan, S., Leser, U., and Rebholz-Schuhmann, D. (2008) Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9(suppl 8), S2.
Jansen, R., Bussemaker, H. J., and Gerstein, M. (2003) Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res 31, 2242–2251.
Johnson, W. E., Li, C., and Rabinovic, A. (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127.
Jothi, R., Kann, M. G., and Przytycka, T. M. (2005) Predicting protein-protein interaction by searching evolutionary tree automorphism space. Bioinformatics 21(suppl 1), i241–i250.
Keedwell, E., and Narayanan, A. (2005) Discovering gene networks with a neural-genetic hybrid. IEEE/ACM Trans. Comput. Biol. Bioinfrm. 2, 231–242.
Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., and Haussler, D. (2002) The human genome browser at UCSC. Genome Res. 12, 996–1006.
Liu, G., Li, J., and Wong, L. (2008) Assessing and predicting protein interactions using both local and global network topological metrics. Genome Inform. 21, 138–149.
Lo, S. L., Cai, C. Z., Chen, Y. Z., and Chung, M. C. (2005) Effect of training datasets on support vector machine prediction of protein-protein interactions. Proteomics 5, 876–884.
Lu, L. J., Xia, Y., Paccanaro, A., Yu, H., and Gerstein, M. (2005) Assessing the limits of genomic data integration for predicting protein networks. Genome Res. 15, 945–953.
Lunt, B., Szurmant, H., Procaccini, A., Hoch, J. A., Hwa, T., and Weigt, M. (2010) Inference of direct residue contacts in two-component signaling. Methods Enzymol. 471, 17–41.
Mahdavi, M. A., and Lin, Y.-H. (2007) False positive reduction in protein-protein interaction predictions using gene ontology annotations. BMC Bioinformatics 8, 262.
Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., and Califano, A. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7(suppl 1), S7.
Miller, J. A., Horvath, S., and Geschwind, D. H. (2010) Divergence of human and mouse brain transcriptome highlights Alzheimer disease pathways. Proc. Natl. Acad. Sci. USA 107, 12698–12703.
Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D. S., Sander, C., Zecchina, R., Onuchic, J. N., Hwa, T., and Weigt, M. (2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA 108, E1293–E1301.
Muthusamy, V., Duraisamy, S., Bradbury, C. M., Hobbs, C., Curley, D. P., Nelson, B., and Bosenberg, M. (2006) Epigenetic silencing of novel tumor suppressors in malignant melanoma. Cancer Res. 66, 11187–11193.
Najafabadi, H. S., and Salavati, R. (2008) Sequence-based prediction of protein-protein interactions by means of codon usage. Genome Biol. 9, R87.
Najafabadi, H. S., Goodarzi, H., and Salavati, R. (2009) Universal function-specificity of codon usage. Nucleic Acids Res. 37, 7014–7023.
Oyama, T., Kitano, K., Satou, K., and Ito, T. (2002) Extraction of knowledge on protein-protein interaction by association rule discovery. Bioinformatics 18, 705–714.
Pawson, T., and Nash, P. (2000) Protein–protein interactions define specificity in signal transduction. Genes Dev. 14, 1027–1047.
Plotkin, J. B., Robins, H., and Levine, A. J. (2004) Tissue-specific codon usage and the expression of human genes. Proc. Natl. Acad. Sci. USA 101, 12588–12591.
Prasad, T. K., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., Telikicherla, D., Raju, R., Shafreen, B., Venugopal, A., et al. (2009) Human protein reference database—2009 update. Nucleic Acids Res. 37, D767–D772.
Procaccini, A., Lunt, B., Szurmant, H., Hwa, T., and Weigt, M. (2011) Dissecting the specificity of protein-protein interaction in bacterial two-component signaling: orphans and crosstalks. PLoS One 6, e19729.
Pugalenthi, G., Kumar Kandaswamy, K., Chou, K.-C., Vivekanandan, S., and Kolatkar, P. (2012) RSARF: prediction of residue solvent accessibility from protein sequence using Random Forest method. Protein Pept. Lett. 19, 50–56.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011) Detecting novel associations in large data sets. Science 334, 1518–1524.
Rhodes, D. R., Tomlins, S. A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana-Sundaram, S., Ghosh, D., Pandey, A., and Chinnaiyan, A. M. (2005) Probabilistic model of the human protein-protein interaction network. Nat. Biotechnol. 23, 951–959.
Schug, A., Weigt, M., Onuchic, J. N., Hwa, T., and Szurmant, H. (2009) High-resolution protein complexes from integrating genomic information with molecular simulation. Proc. Natl. Acad. Sci. USA 106, 22124–22129.
Sharon, I., Davis, J. V., and Yona, G. (2009) Prediction of protein–protein interactions: a study of the co-evolution model. In: Computational Systems Biology (eds.: McDermott, J., Samudrala, R., Bumgarner, R., Montgomery, K., and Ireton, R.), pp.61–88. Humana Press, New York.
Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y., and Jiang, H. (2007) Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA 104, 4337–4341.
Shoemaker, B. A., and Panchenko, A. R. (2007a) Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput. Biol. e43.
Shoemaker, B. A., and Panchenko, A. R. (2007b) Deciphering protein–protein interactions. Part I. Experimental techniques and databases. PLoS Computat. Biol. 3, e42.
Sims, A. H., Smethurst, G. J., Hey, Y., Okoniewski, M. J., Pepper, S. D., Howell, A., Miller, C. J., and Clarke, R. B. (2008) The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1, 42.
Stehman, S. (1997) Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 62, 77–89.
Szurmant, H., Bobay, B. G., White, R. A., Sullivan, D. M., Thompson, R. J., Hwa, T., Hoch, J. A., and Cavanagh, J. (2008) Co-evolving motions at protein− protein interfaces of two-component signaling systems identified by covariance analysis. Biochemistry 47, 7782–7784.
Theofilatos, K. A., Dimitrakopoulos, C. M., Tsakalidis, A. K., Likothanassis, S. D., Papadimitriou, S. T., and Mavroudi, S. P. (2011) Computational approaches for the prediction of protein-protein interactions: A survey. Current Bioinformatics 6, 398–414.
Tock, C. L., Turner, L. R., Altiner, A., Batra, P., Booher, S. L., Coelho, S. G., Warner, J. A., Therrien, J. P., Turner, M. L., Miller, S. A., et al. (2011) Transcriptional signatures of full-spectrum and non-UVB-spectrum solar irradiation in human skin. Pigment Cell Melanoma Res. 24, 972–974.
Torkamani, A., and Schork, N. J. (2009) Identification of rare cancer driver mutations by network reconstruction. Genome Res. 19, 1570–1578.
Torkamani, A., Dean, B., Schork, N. J., and Thomas, E. A. (2010) Coexpression network analysis of neural tissue reveals perturbations in developmental processes in schizophrenia. Genome Res. 20, 403–412.
Wang, K., Saito, M., Bisikirska, B. C., Alvarez, M. J., Lim, W. K., Rajbhandari, P., Shen, Q., Nemenman, I., Basso, K., Margolin, A. A., et al. (2009) Genome-wide identification of post-translational modulators of transcription factor activity in human B cells. Nat. Biotechnol. 27, 829–837.
Warren, P. (2010) Presence-Absence Calls on AffyMetrix HG-U133 Series Microarrays with panp. http://bioconductor.uib.no/2.6/bioc/vignettes/panp/inst/doc/panp.pdf.
Weigt, M., White, R. A., Szurmant, H., Hoch, J. A., and Hwa, T. (2009) Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl. Acad. Sci. USA 106, 67–72.
Wells, J. A., and McClendon, C. L. (2007) Reaching for high-hanging fruit in drug discovery at protein–protein interfaces. Nature 450, 1001–1009.
Xu, L., Shen, S. S., Hoshida, Y., Subramanian, A., Ross, K., Brunet, J.-P., Wagner, S. N., Ramaswamy, S., Mesirov, J. P., and Hynes, R. O. (2008) Gene expression changes in an animal melanoma model correlate with aggressiveness of human melanoma metastases. Mol. Cancer Res. 6, 760–769.
Yu, J., Guo, M., Needham, C. J., Huang, Y., Cai, L., and Westhead, D. R. (2010) Simple sequence-based kernels do not predict protein–protein interactions. Bioinformatics 26, 2610–2614.
Zahiri, J., Hannon Bozorgmehr, J., and Masoudi-Nejad, A. (2013a) Computational prediction of protein–protein interaction networks: algorithms and resources. Curr. Genomics 14, 397–414.
Zahiri, J., Yaghoubi, O., Mohammad-Noori, M., Ebrahimpour, R., and Masoudi-Nejad, A. (2013b) Protein-protein interaction prediction from PSSM based evolutionary information. Genomics 102, 237–242.
Zhang, L. V., Wong, S. L., King, O. D., and Roth, F. P. (2004) Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC bioinformatics 5, 38.
Zhou, Y., Zhou, Y. S., He, F., Song, J., and Zhang, Z. (2012) Can simple codon pair usage predict protein–protein interaction? Mol. BioSyst. 8, 1396–1404.

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）