2020 Volume 17 Pages 2-13
Protein functions can be predicted based on their three-dimensional structures. However, many multidomain proteins have unstable structures, making it difficult to determine the whole structure in biological experiments. Additionally, multidomain proteins are often decomposed and identified based on their domains, with the structure of each domain often found in public databases. Recent studies have advanced structure prediction methods of multidomain proteins through computational analysis. In existing methods, proteins that serve as templates are used for three-dimensional structure prediction. However, when no protein template is available, the accuracy of the prediction is decreased. This study was conducted to predict the structures of multidomain proteins without the need for whole structure templates.
We improved structure prediction methods by performing rigid-body docking from the structure of each domain and reranking a structure closer to the correct structure to have a higher value. In the proposed method, the score for the domain-domain interaction obtained without a structural template of the multidomain protein and score for the three-dimensional structure obtained during docking calculation were newly incorporated into the score function. We successfully predicted the structures of 50 of 55 multidomain proteins examined in the test dataset.
Interaction residue pair information of the protein-protein complex interface contributes to domain reorganizations even when a structural template for a multidomain protein cannot be obtained. This approach may be useful for predicting the structures of multidomain proteins with important biochemical functions.
More than half of prokaryote and eukaryote genes produce multidomain proteins with multiple partial structures known as protein domains [1,2]. Each protein domain is folded into a tight and stable structure that is conserved among different multidomain proteins .
Because protein functions differ depending on their three-dimensional (3D) structures, determining these structures is very important . Typically, the 3D structure of a protein is determined by X-ray crystallography analysis, nuclear magnetic resonance analysis, or electron microscopy. However, determining structures experimentally is difficult because multidomain proteins are generally difficult to crystallize [5,6]. Even if the entire 3D structure has not been elucidated, structure information for protein domains is collected in public databases such as the Protein Data Bank (PDB) . Therefore, experimental costs can be reduced by computationally predicting the whole structure based on the structure of each domain. Structure prediction methods include ab initio methods [8–10], template-based methods [11–13], and template-free methods [14–16].
Ab initio methods require high levels of computational resources to construct an entire protein structure, and thus it is difficult to predict the structures of all multidomain proteins using these methods . Template-based methods require homologous templates whose 3D structural information is known; when such proteins are not available or only low sequence similarity proteins are available, prediction accuracy may be low . Template-free methods are based on the property that protein domains are conserved even in multidomain proteins and can be used to predict structures by rigid-body docking with the 3D structure of protein domains. This method is superior to other methods in that the entire multidomain protein structure can be predicted from the 3D structure of protein domains stored in the PDB. Hirako, S. et al.  used individual domains of tertiary structures in two-domain proteins and structural docking tools for protein complexes to generate structural models for two-domain protein conformational prediction. They developed a scoring method for selecting an appropriate association model between two domains. However, the scoring method requires information on the structure of a homologous template multidomain protein and cannot be used when the template structures do not exist. The number of protein tertiary structures that can function as templates is continually increasing . A template structure may be sufficient for protein complex structure prediction ; still this approach may not be effective in multidomain proteins .
In this study, we developed a new scoring method, PINE for selecting a structural model of a two-domain protein even when template structures cannot be obtained. This method estimates the pattern of interaction between domains using dimeric protein-protein interaction (PPI) residue pair information. As a result, the association model of the two domains can be selected with high accuracy when a template structure cannot be obtained.
First, we obtained the structure of a protein with two domains from PDB. Next, we separated each domain at the linker region and predicted the original structure from these two domains. The structural model was generated using a rigid-body docking tool, and reranking (scoring) was performed for the generated structural model using a scoring function. By reranking, a structure close to the original structure was predicted to have a higher rank. A structure whose root mean square deviation (RMSD) with the original structure was within 10 Å (acceptable structure) in the top 10 positions was defined as a successful prediction (Fig. 1).
Problem setting. Step 1: We obtained a two-domain protein from the PDB and divided each domain according to the definition. Next, we generated a structural model using a rigid-body docking tool with two input domain structures. Step 2: We calculated some scores for each structural model. Step 3: We evaluated and reranked each model with a score function using some scores. As a result, the prediction was considered as successful when there was at least one acceptable structure within the top 10 solutions.
Division of multidomain proteins into domains was performed according to the definition of SCOP . However, the region between domains, known as the linker, is not always clearly defined. Therefore, we defined the linker region as the region between the last residue of the secondary structure of the N-terminal domain and first residue of the secondary structure of the C-terminal domain. The region without secondary structure between the domains was considered as the linker. DSSP  was used to determine whether secondary structures were formed for each residue. Additionally, the linker region was excluded during docking calculation and during model evaluation.
DINE is a method of reranking (scoring) the results of rigid-body docking of each domain structure for two-domain proteins. First, 2,000 domain-domain docking poses are generated by rigid-body docking using ZDOCK , after which scoring is performed by calculating the linear sum of the binding energy score Szrank, inter-domain distance score Sete, and domain interface score Sint. Finally, the DINE score is obtained from these scores as a linear weighted sum as follows.
The binding energy score (Szrank) is a value calculated by ZRANK , a protein complex scoring tool. The score zr obtained by ZRANK is based on van der Waals energy, electrostatic interaction energy, and desolvation energy [22,23]; a smaller value is preferable . Szrank is calculated with the following equation using zr, where zrmax and zrmin are respectively the maximum and minimum zr values for the generated poses.
The domain interface score (Sint) is a value of the interface of the binding pose estimated from the homology template structure. Protein domains with 30% or more sequence homology may have the same spatial arrangement and interaction surface within multidomain proteins [25,26]. Therefore, previous studies were conducted to improve prediction accuracy by predicting the interaction surface using the known structure of a multidomain protein. The KIP method  was used to predict interaction surfaces and obtain Sint. The KIP method uses a database in which multidomain proteins whose entire structures and interaction residues are known and are clustered by homology. The authors searched for proteins homologous to the query multidomain protein in the database and predicted interacting residues. Sint is the ratio of this predicted interaction residue to the interaction residue of each structural model.
The inter-domain distance score (Sete) is calculated from the statistics of the residue length of the linker region and inter-domain distance. Sete is calculated using the following equation
where L is the number of residues in the linker region, de is the distance between domains of the structural model, μe(L) and σe(L) are the mean and standard deviation of the distance between domains when the number of linker residues is L, respectively; μe(L) and σe(L) are based on the statistics from 1,657 multidomain proteins .
The DINE Sint function cannot be calculated when no template proteins are available. DINE uses structural information of homologous multidomain proteins as query proteins to calculate Sint. Therefore, in this study, we propose a new scoring method named PINE. PINE solved this problem by using a heterodimeric interaction residue pair score Sppi, which is a new scoring term for predicting the interaction surface, without using a protein as a homologous template, and docking score Sdock, calculated during rigid-body docking. Figure 2 shows the flow of the proposed PINE method.
a. Outline of the proposed method. Step 1: Obtain domains as in the existing method and generate 10,800 structural models for each multidomain protein with the rigid-body docking tool MEGADOCK. Docking score is calculated at the time of this docking calculation. Step 2: Calculate the proposed score for each structural model. Step 3: Rerank the structural model with the score function PINE using each score and predict the structure. b. Calculation of the interaction amino acid residue score Sppi. The PPI matrix P was generated from protein complex structures in advance and the domain contact matrix C was generated from a predicted two-domain protein structure. Next, Sppi of the predicted structure was calculated with P and C. This score predicts interactions without requiring a template for the overall structure of the multidomain protein.
Generation of the structural model was performed using MEGADOCK 4.0.2 [27,28]. The top three structural models of the docking score were output for each rotation angle, and the number of rotation angles was 3,600. Therefore, 3×3,600=10,800 poses were used for structural prediction of one multidomain protein (the option -t 3 -N 10800 was used). Among the domains of multidomain proteins, we considered the domain with a large number of residues as the receptor (argument of -R option) and that with few residues as the ligand (argument of -L option). Moreover, domain-domain docking was performed by excluding the linker region.
Rather than using the score based on the interaction surface prediction in the score function of the existing method, Sint, two new terms were used. The interaction amino acid residue score (Sppi) derived from the PPI and the docking score (Sdock) derived from the docking calculation with MEGADOCK are new terms in the proposed score function of PINE. Szrank and Sete are the same as those used in the previous study . To calculate Szrank, hydrogen must be added to the structural model, which was performed using reduce ver. 3.23 . The score function is defined as follows:
The interaction amino acid residue score is a term that focuses on the combination of interacting residues. The domain-domain interaction interface is predicted from the pair of amino acid residues involved in the PPI. This eliminates the need to know the entire structure of the multidomain protein homologous to the prediction of the interaction surface. In this study, if the distance between Cα atoms in two domains was within 8 Å, the residue pair was defined as interacting. Proteins are often multimers, and the PDB catalogs conformations of multimers. Therefore, among the dimers stored in the PDB, the number of amino acid residue pairs present in interacting positions was counted using 12,532 complex structures excluding redundancy based on UniProtID. As a result, a 20×20 upper triangular matrix (PPI matrix) P=(pij) indicating the number of interacting amino acid residues pairs was obtained. However, for each element in the PPI matrix, the number of interaction residue pairs counted was the value obtained by min-max normalization of the minimum value to 0 and maximum value to 1 as follows:
In addition, the number of amino acid residue pairs interacting between domains was counted from each domain docking structure model generated, and the 20×20 upper triangular matrix (domain contact matrix) C=(cij) was determined. For these two matrices, the interaction amino acid residue score Sppi of each model was calculated by the following formulas:
which has a minimum value of 0 and maximum value of 1.
Model generation was performed using MEGADOCK ver. 4.0.2 [27,28]. Sdock is the docking score calculated for each of the 10,800 structural models to be generated and was min-max normalized, so that the score ranges between 0 and 1.
For consistency purposes, we used the same training and test datasets that were used by Hirako, S. et al. . The training set contained 62 non-redundant two-domain proteins originally proposed by Wollacott, A. M. et al.  excluding the 14 proteins defined as single-domain proteins in SCOP . The test set contained 55 non-redundant two-domain proteins used by Cheng, T. M. K. et al. . Parameter optimization of the score function weights, wzrank, wete, wppi, and wdock, was conducted using the training dataset, and evaluation of PINE was performed using the test dataset. These training and test datasets are shown in Tables 1 and 2, respectively.
a The first 4-letters are PDB ID and the 5th letter is chain ID. † It is also included in Cheng dataset (Table 2).
a The first 4-letters are PDB ID and the 5th letter is chain ID. † It is also included in Wollacott dataset (Table 1).
The score function weights wzrank, wete, wppi, and wdock were optimized using the training dataset. Each weight was assigned an integer value of 1–10. First, the structural model was reranked using a score function with all weights set to 1. As a result, 51 of 62 proteins in the training dataset had at least one structure model whose RMSD was less than 10 Å, which is considered as an acceptable model, within the top 10 poses. From this initial state, we changed each weight by 1 and searched for the weight (optimized weight) that maximized the number of proteins with at least one acceptable model within the top 10 poses.
RMSD was used to evaluate the generated structural model, which was defined as follows:
where N is the number of atoms and di is the distance between atoms with the same residue number for the structural model and native structure. Additionally, when calculating the RMSD of the structural model, we only considered Cα atoms between the native structure of the domain with fewer residues and structural model among the two domains of the structural model. We fitted the domain with a large number of residues (receptor) and calculated the RMSD of the domain with a small number of residues (ligand).
Evaluation of the score function was performed using a test dataset. When 10,800 structural models were reranked using a score function for one multidomain protein, the prediction was defined as a ‘success’ if at least one ‘acceptable’ (RMSD <10 Å) model within the top 10 positions was obtained. For the 55 proteins in the test dataset, we determined the score function as the number of successful predictions.
We optimized the four weights wzrank, wete, wppi, and wdock of the score function using the training set for optimization (62 proteins). In the initial state in which all weights were 1, prediction was successful for 51 of 62 proteins. After optimizing the weights, the optimized weights were wzrank=9, wete=2, wppi=1, and wdock=4. When the training dataset was predicted using the score function with this combination of weights, prediction was successful for 52 proteins. The protein (PDB ID: 1BKB) that was newly successfully predicted by weight optimization, had a structural model with a smaller RMSD reranked higher than with the initial state.
After reranking the test dataset (55 proteins) using the optimized weights shown in the previous section, the proposed method obtained ‘acceptable’ predictions (where the top 10 models include at least one model with RMSD <10 Å) for 50 multidomain proteins among the 55 tested. On the other hand, the baseline method (only using Szrank and Sete) in which there was no template and the interaction surface obtained acceptable predictions for 47 proteins among 55 (while the original DINE successfully predicted 46 proteins ). Thus the proposed method, PINE, successfully identified 3 more proteins than the baseline method. In addition, PINE obtained 49 ‘good’ predictions (where the top 10 models include at least one model with RMSD <5 Å) among 55 proteins, whereas DINE obtained 39 among 55 .
When reranking was performed using only interaction amino acid residue scores, only one of the 55 proteins showed an acceptable model within the top 10 positions. The successfully predicted protein (PDB ID: 1BAG) reranked a structural model with an RMSD of 4.3 Å at rank 9.
When the structural model was reranked by docking score only, the prediction was successful for 48 of 55 proteins. The score can predict more proteins than other terms when it was calculated using only one term.
To investigate the contribution of each term (Szrank, Sete, Sppi, Sdock) to the score function, the prediction was performed using a score function in which the combination of terms used was changed (Table 3). The weight of the score function was also optimized in the same manner. The test dataset was reranked using this score function, and Figure 3 shows the distribution of acceptable models.
Total number of acceptable structures in each score function. The score function using the binding energy score reranks many acceptable structures to higher ranks. Moreover, although the interaction amino acid residue score could predict only one protein in the test dataset, the score function containing this term reranks many acceptable structures to higher ranks. Particularly, comparing the proposed method with the score function excluding interaction amino acid residue score from the proposed method, the prediction accuracy was the same, whereas the number of acceptable structures within the top 500 predicted by the former method was larger than that predicted by the latter method.
The weight of each score term in the proposed method was optimized using a training dataset. Among these weights, a largest value was assigned to the binding energy score, wzrank=9, which was more than 2-fold higher than the other weights. Additionally, when the score function containing Szrank was compared to that not containing this value, a higher success rate was achieved. Therefore, the binding energy score was a major contributor to this score function. However, 5 of 10 proteins that could not be predicted using only the binding energy score were predicted using the proposed method.
Prediction using only docking score was successful for 48 of 55 proteins. This was the highest prediction accuracy among those using a single score term. In addition, the weight of Sdock and wdock had the second highest value (=4) after wzrank. Therefore, docking scores were considered to contribute significantly to the accuracy of the proposed method. Among the 10 proteins that failed to be predicted by the binding energy score alone, 5 proteins were successfully predicted by the proposed method. Because these proteins were reranked to the top by prediction based on the docking score, prediction using the proposed method was successful. In fact, the proposed method showed a higher success rate than the score function using the three terms other than Sdock.
Prediction accuracy using only the term based on the inter-domain distance Sete was not high (success rate of 75%), and prediction using a single score was difficult. However, the score function with Szrank, Sdock, and Sete using two terms with high prediction accuracy in a single case and inter-domain distance score showed higher prediction accuracy than that using only Szrank and Sdock. This may be because the distance between the domains was limited.
The value of wppi was the smallest among the four weights. In addition, the number of proteins successfully predicted with only this term was one of the 55 proteins, showing the lowest prediction accuracy. As shown in Figure 4, there was variation from the predicted protein structure (PDB ID: 1BAG), which was the only successful prediction using only Sppi. The only model with an RMSD of less than 10 Å was the 9th model, and the models ranked 1st to 3rd were not similar to the native structure. However, the proposed method using the interaction amino acid residue score predicts more acceptable structures at higher ranks (Fig. 3). This indicates that the type of residue pair interacting in the protein complex also affects interactions between domains.
PDB ID: 1BAG’s receptor domain (gray), 1BAG’s native ligand domain (magenta), four ligand domain models (orange), and the linker region (cyan). This is the only protein in which the permissive structure exists within the top 10 in the prediction based on the interaction amino acid residue score alone. However, the top three structures are far from the native position, and their positions are disjointed. The acceptable model predicted to be 9th. Even for other proteins, predictions with only interaction amino acid residue scores may dislocate the position of the structural model to prevent concentration near the native structure.
Table 4 shows the proteins for which prediction failed. These five cases, also failed when the prediction was performed based on binding energy. This indicates that prediction using the proposed method fails if the wzrank value is large and prediction accuracy is poor. Additionally, the interaction surface of the native structure of the proteins for which prediction failed were smaller than 1,400 Å2. In these cases, because of the weak interaction, the results of protein docking became worse , and thus prediction based on binding energy and the generation accuracy of the structural model by docking showed worse results. Therefore, this prediction failed.
Figure 5 shows an example of successful prediction in this study. This protein (PDB ID: 1AOR) was also reranked as 1st in the docking score and binding energy score by ZRANK. Of the generated structural models, the centroids of the top 100 ligand models ranked by docking scores were plotted. The results indicated that the position of the ligand domain was correctly predicted. In this case, the structure model ranked 1st according the docking score was also predicted based on the binding energy score and the proposed method. Moreover, the interaction surface was predicted accurately, and the high-rank structure models were gathered near the native structure. Another case visualizes how PINE works successfully. The protein in Figure 6 (PDB ID: 1BKB) suggests that the ligand centroids of the top 100 models by PINE were closer to the native position than the ligand centroids of the top 100 from the initial docking.
Native protein structure of PDB ID: 1AOR and center of gravity for top 100 ligands in docking score. Gray indicates the receptor domain of 1AOR, magenta indicates the native ligand domain, orange indicates top 1 ligand structure model predicted by PINE, cyan indicates the linker region, and yellow spheres indicate the center of gravity of the top 100 ligand models generated by initial docking.
Native protein structure of 1BKB and center of gravity for top 100 ligands in docking score and PINE. Gray indicates the receptor domain of 1P5U, magenta indicates the ligand domain, cyan indicates the linker region, yellow sphere indicates the center of gravity of the ligand in the structural model generated by initial docking, and green sphere indicates the center of gravity of the ligand in the structural model generated by PINE.
However, the structural model of the ligand domain for the failed protein (PDB ID: 1P5U) was concentrated at a position that differed from that of the native structure (Fig. 7). This indicates that the model generation accuracy by MEGADOCK was low for this protein. Moreover, the prediction was poor even when reranking by binding energy was performed; as a result, prediction failed when using the proposed method. As described in the previous section, protein 1P5U may have failed the prediction because of its small interaction surface. The interaction surface area of the successfully predicted protein 1AOR was 3,201 Å2, whereas that of the failed protein 1P5U was 585 Å2. This indicates that 1P5U domains interact on a small interface compared to ordinary PPI interface. The domain-domain interaction form was constrained by its linker, leading to distance limitation, and was different from PPI with freely forming interaction. Hence, domains in a multidomain protein can interact with each other not only in a broad interface but also on a narrow interface. In such a case, accurate prediction using free form docking may be difficult. Figure 7 shows that structural models may be generated on the wrong surface. As observed for other failed proteins, the center of gravity of the top 100 structural models of the docking score was not concentrated at the position of the ligand domain in the native structure. In contrast, in most proteins which were predicted at the 1st rank, the center of gravity was concentrated at the position in the native structure. We expect that it is possible to generate a highly accurate structural model by modifying it so that the interaction surface can be limited when generating the model.
Native protein structure of 1P5U and center of gravity for top 100 ligands in docking score and PINE. Gray indicates the receptor domain of 1P5U, magenta indicates the ligand domain, cyan indicates the linker region, yellow sphere indicates the center of gravity of the ligand in the structural model generated by initial docking, and green sphere indicates the center of gravity of the ligand in the structural model generated by PINE.
Unlike PPI, the interaction between domains of multidomain proteins is limited in distance by linking. Thus, it is not always the case that the domain binding is coupled with an interface which obtained optimal binding free energy. Therefore, we verified whether the model selection could be improved by changing the domain-domain linking score, which limits the distance between domains. Specifically, the Sete formula was modified as follows to give a penalty depending on the distance between domains.
As a result of this improvement, the protein (PDB ID: 1QCS), as an example, had the number of correct models within the top 100 increased from 12 by original PINE to 22 by modified PINE. However, the S'ete did not improve overall performance. If the linker length is short, scoring with Sete will work, but it may not work well if the linker length is long. As shown in Figures 6 and 7, some models were not properly located within reach of the linker. As a result, it can be considered that it is difficult to score appropriately only by statistical information of linker length and interdomain distance. Development of a more refined score function depending on the secondary structure of the linker and the vector direction at the domain terminal is needed. In addition, since a turn that suddenly changes direction at the terminal of the linker is unlikely to occur intrinsically, it is necessary to consider other methods such as removing a structural model that takes such a difficult conformation.
Most proteins have multiple domains. Because these domains are folded and stable, various methods can be used to predict the whole protein. The rigid-body docking method uses the 3D structure of the domains and does not require a template for predicting the whole structure of a multidomain protein; this method can also predict whole structures from partial structures. In this study, we improved the method for predicting multidomain protein structures with two domains through computation. Existing methods for predicting multidomain protein structures by rigid-body docking using protein domains require template proteins with homology to the query proteins, particularly in the interface area.
In this study, we proposed a method for predicting multidomain protein structure using an interaction amino acid residue score without requiring a template protein structure. This score estimates domain-domain interactions based on protein-protein interactions. The interaction and docking scores calculated by docking analyses were used as the score function rather than the interface prediction of the existing method. As a result, the prediction was possible even when no template protein structure was available; by using the interaction amino acid residue score, more acceptable structures could be reranked to higher values. Using the interaction amino acid residue score alone made reranking into the top 10 difficult, but the score contributed to improving the reranking. This suggests that the prediction of the interaction surface between domains is based on the PPI. Additionally, in cases where the prediction accuracy of docking and energy scoring is poor, the interaction surfaces tend to be small. In such a case, a score that can predict interaction surfaces between domains is important. For the structural prediction of multidomain proteins that are difficult to predict, designing interaction amino acid residue scores corresponding to cases with small interaction surfaces should be further examined. A previous study  gave a uniform score for the structural model that fits the linker length condition based on the inter-domain distance. Studies are needed to improve this approach, such as by gradually changing this score or giving a higher score to a structural model closer to the native structure.
We would like to thank Dr. Masafumi Shionyu for providing the detailed multidomain protein dataset information and for insightful discussion. This work was partially supported by JSPS KAKENHI (Grant Nos. 17H01814, 18K18149), JST Research Complex Program “Wellbeing Research Campus: Creating new values through technological and social innovation,” MEXT Program for Building Regional Innovation Ecosystems “Program to Industrialize an Innovative Middle Molecule Drug Discovery Flow through Fusion of Computational Drug Design and Chemical Synthesis Technology,” and AMED Platform Project for Supporting Drug Discovery and Life Science Research (Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)) (Grant No. JP18am0101112). We would like to thank Editage (www.editage.jp) for English language editing.
SM, MO and YA declare that they have no conflict of interest.
SM, MO, YA: conceived and designed the experiments; SM: performed the experiments; SM, MO, YA: analyzed the data; SM, MO, YA: wrote the paper. All the authors approved the final version of the manuscript.