Translational and Regulatory Sciences
Online ISSN : 2434-4974
Biochemistry
New era in structural biology with the AlphaFold program
Ken-ichi MIYAZONOMasaru TANOKURA
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2022 Volume 4 Issue 2 Pages 48-52

Details
Abstract

Proteins control all biological processes. Therefore, understanding protein functions is indispensable for elucidating each life phenomenon, including the pathogenic mechanisms of diseases. In structural biology, three-dimensional structures of proteins are used to uncover their functions. Thus far, more than 180,000 structures, determined using X-ray/neutron crystallography, nuclear magnetic resonance, or cryo-electron microscopy, have been deposited in the Protein Data Bank. These structures have significantly contributed to our understanding of life. During the summer of 2021, two artificial intelligence (AI) programs that can predict protein structures were released (AlphaFold and RoseTTAFold). These AI programs can predict highly accurate three-dimensional structures of proteins from their amino acid sequences. AlphaFold can predict protein structures with high accuracy; therefore, structural biologists and other scientists can now easily predict the protein structure of interest without requiring any specialized skill or equipment. Furthermore, AlphaFold accelerates the experimental protein structure determination because the program-generated structures can be excellent starting models for experimental structure determination. In contrast, these AI programs use only information based on amino acid sequences. They cannot predict complex structures and conformational changes the proteins adopt while interacting with other proteins or performing vital biological processes. In this review, we have discussed the significance of AlphaFold in structural biology.

Highlights

The functions of proteins are tightly associated with their three-dimensional structures. Therefore, understanding the structure of proteins is essential to elucidate their functions. In summer 2021, two artificial intelligence programs for protein three-dimensional structure prediction (AlphaFold and RoseTTAFold) were released. These programs can predict protein structures from amino acid sequences with high accuracy. Therefore, all scientists now have access to predicting the structure of their protein of interest without requiring any specialized skill or equipment. However, these programs cannot predict the structure of all proteins. Therefore, the experimental determination of protein structures is indispensable.

Introduction

Proteins are biomacromolecules consisting of amino acids and are indispensable for governing life processes. In general, a single chain of amino acids translated from an mRNA is folded into a specific three-dimensional structure according to the chemical properties of each amino acid. Understanding protein structures is crucial because protein functions, such as the mechanism by which they catalyze a specific chemical reaction or interact with other chemicals or biomacromolecules, depend on their three-dimensional structures. Structural biology is the study of protein function through its three-dimensional structures. Protein structures can also be used to design and optimize chemical structures that can regulate the functions of target proteins and identify suitable compounds for clinical trials (structure-based drug design). Thus far (by the end of 2021), more than 180,000 structures of biomacromolecules, including proteins, nucleic acids (DNA and RNA), and their complexes, have been deposited in the Protein Data Bank (PDB), and their coordinates are freely available [1]. These biomacromolecular structures have been experimentally determined by X-ray or neutron crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy. High-resolution protein structures determined by these methods have significantly improved our knowledge of the associated molecular mechanisms, including disease pathogenesis.

In summer 2021, artificial intelligence (AI) programs, AlphaFold (DeepMind) [2, 3] and RoseTTAFold (Baker group) [4], were released. The AI programs were trained by showing several known protein sequences and structures. These programs can accurately predict three-dimensional protein structures from amino acid sequences, even in the absence of homologous protein structures. Although the structure predictions by these AI programs require high-performance computing systems, each researcher can use them on web servers, such as ColabFold [5] (preprint) and Robetta (https://robetta.bakerlab.org). In addition, a database (AlphaFold Protein Structure Database) that includes the predicted structures of the human proteome has been released [6]. The development of these programs has made it easier for all scientists to predict the protein structure of interest. As protein structures contain considerable information about their functions, these programs have the potential to accelerate research in all fields of biology.

In this review, we have discussed the advantages and limitations of these programs. A highly accurate protein structure generated by the AI programs can be an excellent starting model for structure determination using X-ray crystallography. However, these programs only use amino acid sequences as inputs for structure prediction. Therefore, they cannot predict complex protein structures with other molecules and conformational changes of proteins. Although AlphaFold and RoseTTAFold can accurately predict the three-dimensional structures of proteins, the experimental determination of protein structures is still required.

Protein Structure Determination Using a Model Generated by AlphaFold

Structural determination by X-ray crystallography is the most powerful method for determining the three-dimensional structures of biomacromolecules. Thus far, more than 87% of the structures in PDB have been determined by X-ray crystallography. To determine the protein structure by X-ray crystallography, the protein of interest must be isolated with high purity and crystallized. The obtained protein crystals are then exposed to X-rays to collect their diffraction images, and the diffraction patterns are integrated and scaled. The scaled dataset is converted to a three-dimensional image of the electron density in the crystal (this process is called “phasing”). The protein structure is built to fit the electron density map. One of the bottlenecks in X-ray crystallography is the initial phasing step. The initial phasing of the diffraction data is performed using the following method: if there is an already known protein structure that has been predicted to resemble the target protein structure (with amino acid sequence identity >30%), the initial phase can be estimated using the molecular replacement method; In this method, information pertaining to known structures is used to estimate the initial phase. The molecular replacement method requires no additional experimental procedure; therefore, it is optimal for structure determination when a suitable model structure is available. In contrast, if no protein structure resembles the structure of the target protein, the initial phase can be estimated using single- or multi-wavelength anomalous dispersion methods and the multiple isomorphous replacement method. Structure determination using these methods requires the preparation of heavy-atom derivative crystals and the collection of their X-ray diffraction data. AlphaFold and RoseTTAFold can predict highly accurate protein structures (including membrane protein structures) from amino acid sequences [2, 4]. Therefore, if crystal diffraction data are collected, the protein structure can be determined by the molecular replacement method using an AlphaFold-generated model structure. This method is powerful because it has the potential to determine the structure of all proteins using the molecular replacement method without the need to prepare heavy-atom derivative crystals (Fig. 1A).

Fig. 1.

Determination of protein structure using AlphaFold-generated model structure. (A) Flowchart of the protein structure determination by X-ray crystallography using the predicted structure by AlphaFold. (B) Structure of prolyl endoprotease (PEP), predicted by AlphaFold and RoseTTAFold. The ribbon is colored from blue (at the N-terminus) to red (at the C-terminus). (C) Superposition of the experimentally determined PEP structure (green) with the structure predicted by AlphaFold (grey). Four glycans and three disulfide bonds are shown by stick models and sphere models, respectively. The position at which AlphaFold could not predict accurate structure is indicated as a red dotted square. (D) Superposition of the experimentally determined electron density map (blue mesh) with the structure predicted by AlphaFold (stick models). The catalytic triad residues of PEP (Asp-His-Ser) are labeled in white. (E) The active site structure of PEP. PEP has a wide-open catalytic pocket (sites 1 to 3) compared to its homologs to recognize large substrates, such as Pro-X bonds in proteins. All protein structures are depicted using PyMOL (http://www.pymol.org/). The electron density map is depicted using the Coot software [24].

In our previous study, we determined the structure of prolyl endoprotease (PEP) by the molecular replacement method using an AlphaFold-generated model structure [7]. PEP is a monomeric serine protease from Aspergillus niger that catalyzes the hydrolysis of peptide bonds between proline and X in proteins [8]. PEP can be used for debittering protein hydrolysates [9], alleviating the symptoms of celiac disease caused by proline-rich gluten-derived T cell epitopes [10], and protein digestion during hydrogen-deuterium exchange mass spectrometry (HDX-MS) assays [11]. PEP belongs to the serine peptidase family S28, including prolylcarboxypeptidase (PRCP), which removes a C-terminal amino acid adjacent to proline in peptides [12], and dipeptidyl peptidase 7 (DPP7), which removes an N-terminal X-Pro-dipeptide from a peptidic substrate [13]. The most striking difference between these enzymes is that although PRCP and DPP7 only recognize the terminal regions of peptides, PEP can recognize Pro-X bonds in proteins. The difference in substrate specificities suggests that PEP has a characteristic active site structure that enables the recognition of bulky substrates, such as Pro-X bonds, in proteins. To reveal the substrate recognition mechanism of PEP, we crystallized PEP and collected its X-ray diffraction data. However, we failed to determine the PEP structure for more than 15 years because PEP has low amino acid sequence identities with its homologous proteins with solved structures and because we could not produce heavy atom derivative crystals of PEP. To overcome these difficulties, we predicted the PEP structure using AlphaFold and RoseTTAFold programs and used them as a template for the molecular replacement method (Fig. 1A) [7].

The PEP structure predicted by AlphaFold closely resembled that predicted by RoseTTAFold (Fig. 1B). The root mean square deviation (RMSD) between these two predicted structures was 1.00 Å for 432 Cα atoms. The PEP structure was easily determined by the molecular replacement method using its AlphaFold-generated model. PEP consists of an α/β hydrolase domain that contains the catalytic triad (Ser179-His491-Asp458) and an SKS domain rich in helical structures. The electron density map showed that PEP has four N-linked glycans and three disulfide bonds. The experimentally determined PEP structure showed high similarity to that predicted by AlphaFold (Fig. 1C). The RMSD between the experimentally determined and AlphaFold-predicted PEP structure was 0.497 Å for 448 Cα atoms. The experimentally determined electron density map of the PEP active site fitted well with that predicted by AlphaFold (Fig. 1D). These observations indicate that AlphaFold predicted a highly accurate PEP structure. The PEP structure determined experimentally, using AlphaFold, showed that PEP has a wide-open catalytic pocket compared to PRCP and DPP7. The characteristic catalytic pocket structure has been predicted to be vital for protein substrate recognition by PEP (Fig. 1E) [7].

Structures that AlphaFold Cannot Predict

AlphaFold and RoseTTAFold use only amino acid sequences as input; therefore, these programs cannot predict complex structures of proteins with other molecules, such as substrates and ions, nor can they predict conformational changes of proteins caused by binding to other compounds or post-translational modifications. In the case of the PEP structure determination, AlphaFold and RoseTTAFold could not predict the glycan structures and their dependent conformational changes (Fig. 1B, 1C) [7]. Interactions of proteins with other molecules and conformational changes are important aspects of protein function; hence, it is indispensable to experimentally determine protein structures.

Intermolecular interactions between proteins (protein–protein interactions, PPI) are involved in most biological processes. Therefore, understanding the structures of protein complexes is important to clarify the mechanism by which they function and how they are regulated (PPIs are considered promising drug targets). AlphaFold and RoseTTAFold can predict the structures of individual proteins and their multimers. For protein complex prediction, an enhanced version of AlphaFold (AlphaFold-Multimer) was released [14] (preprint). In addition, RoseTTAFold has been used to predict PPIs in core eukaryotic protein complexes [15]. To evaluate the accuracy of PPI prediction, we predicted the complex protein structures involved in transforming growth factor (TGF)-β signaling (Fig. 2A).

Fig. 2.

Prediction of protein–protein interactions by AlphaFold. (A) TGF-β signaling in cells. R-SMAD proteins (SMAD2 and SMAD3) form various transcription factor complexes with SMAD cofactors to regulate TGF-β signal-dependent gene expression. (B) Experimentally determined SMAD2/3-cofactor complex structures. SARA (SMAD2/3 phosphorylation activator), FOXH1 (transcription factor), SKI (transcription corepressor), MAN1 (SMAD2/3 dephosphorylation activator), and CBP (transcription coactivator) are shown in different colors, as indicated. (C) Superposition of the experimentally determined complex structures with the AlphaFold-predicted complexes (red).

TGF-β is a multifunctional cytokine that regulates various biological processes, including cell proliferation, differentiation, apoptosis, immune response, autophagy, cell migration, and extracellular matrix formation [16]. Therefore, dysregulation of TGF-β signaling causes diseases such as cancer and fibrosis [17, 18]. In cells stimulated by TGF-β, transcription factors, such as SMAD2 and SMAD3, are phosphorylated. SMAD2 and SMAD3 form several transcription factor complexes with other proteins (SMAD cofactors) to regulate TGF-β-dependent gene expression. Thus far, the structures of SMAD2/3–SARA complexes [19, 20], SMAD3–FOXH1 complex [21], SMAD2–SKI complex [21], SMAD2–MAN1 complex [22], and SMAD2-CBP complex [23] have been determined by X-ray crystallography (Fig. 2B). Among the SMAD cofactors, SARA, FOXH1, SKI, and CBP interact with SMAD2/3 via their intrinsic disordered regions. To evaluate the accuracy of protein complex structure prediction by AlphaFold, we predicted the structures of these SMAD-cofactor complexes using AlphaFold on the ColabFold server. The results of the complex structure prediction showed that three of the five complexes (SARA, MAN1, and CBP) were correctly predicted (Fig. 2C). This is consistent with the fact that AlphaFold can predict protein complex structures; however, its accuracy is relatively low. Therefore, the predicted complex structures must be validated using other methods, such as binding assays or experimental structure determination.

Conclusions

AlphaFold and RoseTTAFold, AI programs taught by several protein sequences and structures, can predict three-dimensional structures of proteins with high accuracy. These programs enable scientists to predict protein structures using only amino acid sequences without requiring any specialized skill or equipment. In addition, the accurately predicted structures generated by these methods can be an excellent starting model for experimental structure determination. The highly accurate structures predicted by these programs have the potential to accelerate research in all fields of biology. In contrast, these programs cannot predict complex structures of proteins with other molecules, nor can they predict conformational changes because they only use amino acid sequences as input. AlphaFold and RoseTTAFold can predict protein–protein interactions. However, the structure prediction of a protein complex is less accurate than that of a single protein. Understanding the mechanisms of intermolecular interactions and conformational changes is essential for understanding and modifying protein functions. Therefore, the experimental determination of protein structures is indispensable.

Conflict of Interest

The authors declare no conflict of interest.

Acknowledgments

This work was supported by the Targeted Proteins Research Program (TPRP) of the Ministry of Education, Culture, Sports, Science, and Technology, Japan, and by JSPS KAKENHI grant numbers 15K14708, 17K19581, 23228003, and 20H02910.

References
 
© 2022 Catalyst Unit

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top