2022 Volume 4 Issue 2 Pages 45-47
Approaches applied for protein structure prediction are roughly divided into two categories: deductive and inductive. The effectiveness of inductive approaches, including AlphaFold, has stood out to date because the patterns of protein structures, the so-called fold or topology, are expected to be limited. In this review, we introduce and outline AlphaFold, which has recently emerged from the recent increase in datasets of protein sequences and structures. In addition, we discuss their effects on drug discovery and development.
The increasing amounts of data on protein sequences and structures and the development of deep learning have led to breakthroughs in protein structure prediction. A recent neural network-based model, AlphaFold, enables accurate structure prediction for a significant portion of proteins. Highly accurate three-dimensional structural models of most proteins, including many drug targets, are readily available. This is expected to accelerate further studies on drug discovery and development. In this article, we introduce an overview of AlphaFold and its application, including our research of predicted structure models provided by AlphaFold.
Recent rapid accumulation of data related to biomolecules, particularly nucleic acids and proteins, and the development of deep learning are spurring progress in various sciences, including drug discovery and development. In protein structure prediction, an epoch-making method, AlphaFold, was proposed [1]. This report introduces recent trends in protein structure prediction leading to this success and explains the relationship between the significant increase in data and deep learning development. This success is expected to spur advancements in various life science research efforts, including drug discovery and development.
Large amounts of protein amino acid sequence information have been obtained from large-scale sequence data provided by massive parallel sequencing (MPS). The rapid accumulation of data in protein science has enabled the construction of multiple sequence alignments (MSAs) using numerous related (homologous) sequences for many proteins. This rapid progress has improved the accuracy of both profile–profile comparison and contact prediction methods using MSA. The prediction accuracy of AlphaFold strongly depends on the number of (effective) sequences included in the MSA containing the protein to be predicted. Although AlphaFold mainly prepares MSAs using UniRef [2], in some cases constructing an MSA containing a sufficient number of related proteins is not possible using UniRef sequence data alone. Even in such cases, a sufficient number of sequences can be prepared using a larger database such as the Big Fantastic Database (BFD) [3]. For BFD construction, the MSA calculation method FAMSA [4] was used. It uses the MIQS amino acid score matrix we proposed [5] as the default matrix to search distantly related sequences with high accuracy and sensitivity.
The recent “enrichment” of protein structure information has also been remarkable. As of 2006, Structural Classification of Proteins [6], a hierarchical classification database of three-dimensional protein structures, showed that only approximately 3% of PDB [7] files have corresponded to “new” folds since 1997 [8]. Moreover, the target proteins to be predicted in the Critical Assessment of Structure Prediction (CASP) experiments [9] are rarely selected as proteins that have a clear similarity to proteins with a known structure. However, the proportion showing structural similarity to known structures was large. These eloquently illustrate a situation where inductive prediction approaches, including AlphaFold, using known three-dimensional structure information, are effective.
An excellent feature of AlphaFold is that it has succeeded in integrating the contact prediction results and three-dimensional structure information of the template protein. The inputs of AlphaFold are two types of information related to the template structure identified by HH-Search [10] and MSA containing similar protein sequences. MSA is stored as a tensor, called an “MSA representation”, with s ×r ×c (256 channels) dimensions. This includes the profile values at each residue position of the protein. Template structure information is expressed as a tensor with r ×r ×c (128 channels) dimensions, called a “pair representation”, which contains information regarding the three-dimensional structure, such as distances between residues within template structures. Here, s represents the number of protein sequences contained in the MSA, and r represents the sequence length of the target protein. The MSA representation and pair representation are improved through calculations in Evoformer, which comprises the first half of AlphaFold. In each unit in Evoformer, these representations are repeatedly improved through mutual influence by “self-attention”.
Another prominent feature of AlphaFold is that it is a deep-learning model that outputs the 3D coordinate values of the target protein. Based on the MSA representation and pair representation improved by Evoformer, the 3D coordinate values were calculated in the latter half of the (structure module) of AlphaFold. This calculation uses the first row from the MSA representation, which is called a single representation with r ×c (384 channels) dimensions. It includes information regarding the three-dimensional structure embedded in the pair representation. Using this representation as an input, the translation vector and rotation matrix of each amino acid (when each is regarded as independent) constituting the predicted structure placed at the origin were calculated as initial values. Each amino acid had a suitable position and orientation. Each amino acid is represented by a triangle with three atoms in the main chain, N, Cα, and C, as vertices. Based on this virtually independent arrangement of each amino acid, the three-dimensional coordinate values of all atoms except hydrogen were output using information such as the average bond length between atoms of amino acids, the bond angle, and the angle of rotation of the embedded side chain. A deep-learning model that enables highly accurate prediction can be constructed by learning this series of flows end-to-end using a loss function that explicitly incorporates the deviation between the “correct answer” structure and the predicted structure. AlphaFold also calculates confidence scores, such as per-residue predicted local distance difference test (plDDT) scores, for their predictions. Originally, the lDDT score was developed to assess how well the local environment in a target structure was reconstructed during the prediction [11]. AlphaFold was trained to compute the score with the “true” per-residue lDDT score, based on the distance differences of Cα-atom pairs between a target structure and the predicted structure. The predicted structures were ranked according to their chain plDDT scores, which are the average values of their per-residue plDDT scores.
AlphaFold incorporates a wealth of deep-learning technologies that have been developing rapidly in recent years. In particular, the utilization of the attention mechanism proposed in the field of natural language processing is prominent. Self-attention [12], which is used to improve MSA and pair representation, has been adopted by transformers, thereby greatly improving the accuracy of machine translation tasks. At present, it is being applied to solve various tasks. The application of self-attention, which excels in learning long-distance dependencies among array elements, to contact prediction, where long-distance interactions are important, was unavoidable. For MSA, a task similar to the “masked language model” (MLM) of bidirectional encoder representations from transformers (BERT) [13], a transformer-based language model, is introduced explicitly into the loss function. The characteristics of the protein amino acid sequence were learned by masking and estimating 15% of the amino acids in the sequence (called the cluster center) in MSA. This learning appears to have suppressed the deterioration of the prediction accuracy to some degree, particularly when the input data of MSA are scarce.
Generally, AlphaFold models are expected to be highly accurate for most proteins and domains. However, as these protein models do not contain ligands, it is important to infer the ligands of proteins based on their structural model(s) for drug discovery and development. To this end, we presumed that our method for constructing PoSSuM [14], which is a database for finding similar small-molecule binding sites on proteins, would be useful because the method is scalable [15]. Comparing known and putative ligand-binding sites in proteins has become more beneficial in drug discovery and development, owing to the accumulation of protein-ligand complex structures. For example, PoSSuM was used to find potential targets of the flavonoid diosmetin, which shows anti-AML (Acute Myeloid Leukemia) activity [16].
Although AlphaFold models are highly accurate, it was reported that a slightly contracted pocket in an AlphaFold model may be able to have a “distorted” effect in a docking study [17]. In contrast, our method can identify both known and putative binding sites on protein models based on their physicochemical and geometric similarities, with some tolerance to existing ligand-binding sites on proteins (Fig. 1). This may compensate for the effects originating from a contracted pocket, and comparison results will provide clues to seek seeds and develop lead compounds for a target protein. We plan to expand our PoSSuM database to include human protein models produced by AlphaFold.
An imputed ligand in an AlphaFold model. The AlphaFold model of human glutathione transferase (cyan) is depicted with an imputed ligand, GSDHN (magenta).
The large amount of predicted structural data of proteins provided by AlphaFold is expected to have a strong effect on drug discovery and development. For example, drug discovery based on structural information of target proteins has become easier than ever. It is possible that such structural data would help design candidate compounds and in silico screening. Moreover, off-target predictions using structural information and drug repurposing may become more active. This improvement in efficiency is expected to accelerate drug discovery and development.
The authors have no conflicts of interest related to this report or description of the study.
The author would like to express his appreciation for the support from the Platform Project for Supporting Drug Discovery and Life Science Research (Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)) from AMED under Grant Number JP21am0101110.