Biophysics and Physicobiology
Online ISSN : 2189-4779
ISSN-L : 2189-4779
Review Article (Invited)
Quantitative analysis of protein dynamics using a deep learning technique combined with experimental cryo-EM density data and MD simulations
Shigeyuki Matsumoto Shoichi IshidaKei TerayamaYasuhshi Okuno
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2023 Volume 20 Issue 2 Article ID: e200022

Details
Abstract

Protein functions associated with biological activity are precisely regulated by both tertiary structure and dynamic behavior. Thus, elucidating the high-resolution structures and quantitative information on in-solution dynamics is essential for understanding the molecular mechanisms. The main experimental approaches for determining tertiary structures include nuclear magnetic resonance (NMR), X-ray crystallography, and cryogenic electron microscopy (cryo-EM). Among these procedures, recent remarkable advances in the hardware and analytical techniques of cryo-EM have increasingly determined novel atomic structures of macromolecules, especially those with large molecular weights and complex assemblies. In addition to these experimental approaches, deep learning techniques, such as AlphaFold 2, accurately predict structures from amino acid sequences, accelerating structural biology research. Meanwhile, the quantitative analyses of the protein dynamics are conducted using experimental approaches, such as NMR and hydrogen-deuterium mass spectrometry, and computational approaches, such as molecular dynamics (MD) simulations. Although these procedures can quantitatively explore dynamic behavior at high resolution, the fundamental difficulties, such as signal crowding and high computational cost, greatly hinder their application to large and complex biological macromolecules. In recent years, machine learning techniques, especially deep learning techniques, have been actively applied to structural data to identify features that are difficult for humans to recognize from big data. Here, we review our approach to accurately estimate dynamic properties associated with local fluctuations from three-dimensional cryo-EM density data using a deep learning technique combined with MD simulations.

Significance

The experimentally derived structural data of the macromolecules reflect the conformational states found in the samples. Three-dimensional cryo-EM density data implicitly contain dynamics information of the target molecule as it is reconstructed from numerous particle images representing variable conformations attributed to the in-solution dynamics properties. This indicates the potential of cryo-EM data for quantitative investigation of dynamics. A deep learning technique, three-dimensional convolutional neural network, combined with molecular dynamics simulations formulates the relationship between the density data and the quantitative dynamics information, allowing the extraction of the dynamics properties only from the cryo-EM density maps.

Introduction

Protein function is precisely regulated by its three-dimensional (3D) structure and dynamic properties. Thus, it is important to elucidate both features at the atomic level to understand the molecular mechanisms of protein functions. High-resolution 3D structures are experimentally determined by X-ray crystallography, cryogenic electron microscopy (cryo-EM), and nuclear magnetic resonance (NMR). Meanwhile, quantitative analysis of in-solution dynamic behavior requires other experimental or computational approaches, such as NMR, hydrogen-deuterium exchange mass spectrometry (HDX-MS) [1], and molecular dynamics (MD) simulations [2]. Their complementary use uncovers detailed structural features associated with protein function (Fig. 1). Among the approaches for experimentally investigating structural properties and with recent developments in hardware and analysis techniques, cryo-EM has increasingly uncovered novel 3D structures at atomic or near-atomic resolution, particularly those of large and complex macromolecules, accelerating the development of structural biology [36]. Nevertheless, quantitative analysis of the dynamics behavior of such large and complex molecules through conventional experimental and computational approaches is substantially difficult owing to fundamental limitations, such as significant signal crowding and extremely high computational cost. Therefore, the quantitative analysis of the dynamics behavior of large and complex macromolecules targeted by cryo-EM is an important issue in current structural biology research.

Figure 1 

Procedures for investigating protein functions. Protein function is investigated from both aspects of the tertiary structure and the dynamics.

The structural data obtained by the experimental and computational approaches are collected and maintained in databases (Table 1). Protein Data Bank (PDB) contains structural models of biological macromolecules [7]. Electron Microscopy Public Image Archive (EMPIAR) [8] and Electron Microscopy Data Bank (EMDB) [9] contain cryo-EM image data and 3D density maps reconstructed from them, respectively. Biological Magnetic Resonance Bank (BMRB) is a database of NMR data for biological macromolecules [10]. Biological Structure Model Archive (BSM-Arc) is a recently developed database that publishes structural information obtained by computational methods, such as MD simulations and homology modeling methods [11]. These databases are publicly available and facilitate research in structural biology.

Table 1  Databases of structural data associated with biological macromolecules
Database Contents URL Ref
PDB Experimentally determined 3D structures and computed models. https://www.rcsb.org/ [7]
EMDB Cryo-EM density maps and tomograms. https://www.ebi.ac.uk/emdb/ [8]
EMPIAR Raw images in cryo-EM investigations. https://www.ebi.ac.uk/empiar/ [9]
BMRB NMR data in investigations of biological macromolecules and metabolites. https://bmrb.io/ [10]
BSM-Arc Structural data obtained by computational works. https://bsma.pdbj.org/ [11]
AlphaFold DB Atomic models predicted by AI system https://alphafold.ebi.ac.uk/ [12,13]

Structural studies using machine learning (ML) techniques, particularly deep learning (DL) techniques, have been made possible with the availability of the accumulated structural data, and many structural studies combined with these techniques have been reported in recent years. Among these studies, AlphaFold2 [12,13] is one of the most impressive. By applying DL techniques to the structural data in PDB and the primary sequence information, AlphaFold2 has achieved highly accurate predictions of 3D protein structures. DL techniques have also been intensively applied to cryo-EM data, which have increasingly accumulated due to recent technical breakthroughs. cryoDRGN [14] with an image encoder–volume–decoder architecture reconstructs heterogeneous cryo-EM maps from single-particle images. Topaz-Denoise [15] is a noise reduction method that uses an ML framework, Noise2Noise [16], and has been shown to improve the SNR of raw images by approximately 100 times. Emap2sec [17] successfully estimated secondary structure information from intermediate-resolution 3D cryo-EM maps using a 3D-convolutional neural network (CNN) [1820], which shows high performance in object detection and classification in 3D images [2123]. These studies and other DL techniques [24] indicate that ML approaches for finding features hidden in big structural data have become a powerful tool in structural biology research.

Here, we introduce our recently developed DL-based approach, Dynamics Extraction From cryo-EM Map (DEFMap), for predicting dynamic information only from cryo-EM 3D density maps [25]. This paper is an extended version of a Japanese review [26].

Dynamic Properties Hidden in Cryo-EM

In cryo-EM single-particle analysis (SPA) [3,27,28], a 3D density map is reconstructed using a large number of single-particle images of biological macromolecules found in the micrograph (Fig. 2). Because the specimen is prepared by rapidly freezing protein solutions, single-particle images represent various conformational states found in the solution. Therefore, the reconstructed 3D density maps are reflected by the dynamics behavior, in other words, the dynamics information is hidden in the density maps; that is, while the density map intensities of the rigid regions (e.g., protein interior forming the hydrophobic core) are strong, the intensities of flexible regions (e.g., loop regions exposed on the molecular surface) tend to be weak because the various conformational states are averaged (Fig. 2). It is generally recognized that such a relationship exists between map intensities and protein dynamics. However, it is difficult to quantitatively estimate the dynamics properties only from the map intensities as these intensities are affected by several factors other than dynamics, such as local denaturation during sample preparation and preferred particle orientation. In fact, the correlation between the dynamics properties determined by MD simulations and the raw map intensities at the corresponding regions was relatively poor (Fig. 3A, left panels). Thus, other methods, such as HDX-MS and MD simulations, have been additionally applied for the quantitative analysis of protein dynamics.

Figure 2 

Overall workflow to reconstruct 3D cryo-EM maps. In the single particle analysis, 3D cryo-EM maps are reconstructed from a vast number of particle images representing biological macromolecules. The specimens are prepared by rabidly freezing the protein sample solution.

Figure 3 

Correlation of MD-derived dynamics (DynamicsMD) with raw map intensities, the derived local resolution estimates, and the values predicted by DEFMap (DynamicsDEFMap). (A) Improvement of the correlations of raw map intensities with DynamicsMD using DEFMap. Correlation plots for raw map intensities (left panels) and DynamicsDEFMap (right panels) with DynamicsMD are shown along with their corresponding regression lines (orange). Each point represents the residue-specific values which are calculated by averaging the values over each residue. r denotes the correlation coefficients. (B) Comprehensive comparison of the correlation coefficients for DynamicsDEFMap with those for raw map intensities and the derived local resolution estimates. The correlation coefficients are calculated against DynamicsMD. Each point represents the individual cryo-EM maps used in the evaluations and the relationships with raw map intensities and the local resolution estimates are colored by orange and navy, respectively. Regarding local resolution estimates, 10 out of 25 datasets were excluded from the plots because they exhibited inverse correlations. The y=x line is represented by a black dashed line.

Developing a 3D CNN Model to Extract Dynamics Properties from Cryo-EM Maps

DEFMap formulates the relationship between the density data of the cryo-EM map and the corresponding protein dynamics properties with a supervised learning framework utilizing a 3D CNN. To achieve this, many datasets are required for the model training. The 3D cryo-EM maps used as explanatory variables are available in EMDB. However, the database containing the corresponding dynamics information used as objective variables is not available. In this study, dynamics data were generated using MD simulations with the program GROMACS [29]. The initial structures for the simulations were prepared utilizing the experimentally determined atomic coordinates deposited in PDB [7]. The disordered regions containing less than 7 residues were modeled and other non-natural termini were capped with acetyl or formyl groups. As target proteins used for the training, we selected 25 proteins based on the following criteria: (1) proteins with relatively small molecular weights and soluble nature for convenience of the MD simulations; and (2) proteins whose 3D density maps were determined with a resolution better than 4.5 Å.

Based on the idea of predicting the local dynamics properties in DEFMap, the local density data centered on the position of the existing heavy atoms in the corresponding atomic model were extracted from the overall 3D maps as subvoxels with grid lengths of 15 Å. As a preprocessing to efficiently train the model, a 5 Å low-pass filter and unification of the grid width (1.5 Å/grid) were applied to the downloaded cryo-EM maps (Fig. 4). After performing data augmentation by rotating the subvoxels by 90° in the xy, xz, and yz planes, 4,249,300 input datasets were prepared. The logarithm of the root-mean square fluctuation (RMSF), which represents the atomic fluctuations from the averaged positions in the MD trajectories with a length of 30 nsec, was used as the dynamics information (Fig. 4). The 3D CNN model trained using the prepared datasets quantitatively predicts the local dynamics properties in a regression manner only from the 3D cryo-EM map.

Figure 4 

Schematic diagram of dataset preparation and learning in DEFMap. The dynamics information, the logarithm of RMSF, are generated by MD simulations, of which initial structures were modeled from the atomic coordinates downloaded from PDB (the upper workflow). The preprocessed cryo-EM density data obtained from EMDB are used as inputs of the neural network composed of three 3D convolutional layers with Leaky ReLU activation, max pooling and dropout and two dense layers (the lower workflow). Different filter sizes (64, 128, and 256) are applied to the three 3D convolutional layers.

The performance of the constructed model was evaluated by a leave-one-out cross-validation method, in which one of the 25 proteins was used as a test dataset and the remaining 24 were used as training datasets. The correlations between the predicted values and the dynamics were evidently improved compared with those calculated from the raw map intensities (Fig. 3A). The mean (±variance) of the correlation coefficient r obtained in the 25-fold cross validation was 0.665 (±0.124), whereas that calculated from the raw map intensities and the local resolution estimates, which are conventionally used as indices of the dynamics in cryo-EM analyses [30,31], was 0.459 (±0.179) and 0.510 (±0.091), respectively (Fig. 3B). This indicates that DEFMap successfully extracted the patterns associated with the dynamics properties from the cryo-EM density data. While the present DL-based method should provide similar information as local resolution estimates, DEFMap was found to capture the dynamics-associated features better than local resolution estimates on the current datasets. This advantage may be attributed to the supervised learning framework, which enables the model to learn a large amount of density data derived from multiple proteins, i.e., big data.

Performance of DEFMap Against External Datasets

Because DEFMap learns a large scale of the local features found in cryo-EM maps, the constructed model is expected to show high generalization performance for external data not used in the training. To confirm the performance against the external datasets, we predicted the dynamics for three newly selected cryo-EM data (EMD-4241/6FE8 [32], EMD-7113/6BLY [33], and EMD-20308/6PCV [34]). The predicted values for all cases agreed well with the MD-derived dynamics values with correlation coefficients r of 0.727, 0.748, and 0.711, respectively, indicating that DEFMap can make accurate predictions in external datasets (Fig. 5A, left panels). Mapping the predicted values onto the 3D structures showed that DEFMap could successfully capture the general structural features, such as the rigidity of the protein interiors and the flexibilities of the regions exposed to the bulk solvent, as well as the MD simulations (Fig. 5A, right panels). It should be noted that the prediction performances gradually declined as the overall resolution of the density maps worsened, and DEFMap was applicable to maps with a resolution of up to approximately 6–7 Å (Fig. 5B). This can be explained by the loss of detailed local structural information as the resolution decreases. Because the local resolutions of cryo-EM maps are known to vary widely across the molecule, the prediction results for regions with extremely low local resolution should be carefully interpreted.

Figure 5 

DEFMap-predicted results for cryo-EM maps not included in the training datasets. (A) Comparison of the dynamics values derived from MD simulations and DEFMap prediction for three kinds of the external datasets (EMD-4241, EMD-7113, and EMD-20308). The dynamics profiles of as a function of residue IDs and the mapping onto the 3D atomic models with different colors as indicated in the color bar are shown in left and right panels, respectively. The residue IDs in the profiles of the dynamics are numbered in accordance with their order in the corresponding PDB file; r denotes the correlation coefficient. The atomic models are derived from PDB [PDB ID: 6fe8, 6bly, and 6pcv]. (B) DEFMap performances on variable map resolution. The cryo-EM maps used for training dataset are low-pass-filtered to the target overall resolutions, and the resulting maps are used for the model training. (C) Comparison of the computed values with the experimentally derived dynamics data. The predicted values with DEFMap, those derived from MD simulations and the experimentally determined values are denoted as DynamicsDEFMap, DynamicsMD, and DynamicsHDX-MS. The experimental data is derived from Ref. [34]. DynamicsDEFMap and DynamicsMD are converted to fragment-specific values by averaging the values over residues in each fragment detected by the experiments. r denotes a correlation coefficient.

It is important to verify the predicted values with the experimentally determined dynamics properties because DEFMap learns computationally calculated dynamics information. It is therefore favorable, the dynamic data determined by HDX-MS are publicly available for one of the external datasets used in the evaluation: EMD-20308/6PCV [34]. We compared the predicted and MD-derived dynamics values with the experimentally determined values and found that both correlated well with the experimental data with correlation coefficients r of 0.743 and 0.791, respectively (Fig. 5C). These evaluation results using external datasets emphasize that predictions using DEFMap can provide insights equivalent to those obtained by experimental approaches from the 3D cryo-EM maps that the user determined themselves. Thus, we further explored the impact of DEFMap in structural biology research, and the findings are introduced in the following section.

Impact of DEFMap on Structural Biology Research

Biological phenomena are supported by numerous molecular interactions. When a ligand interacts with a biological macromolecule, the dynamics of the binding sites are generally suppressed by conformational stabilization. We attempted to detect ligand-induced modulation of the dynamics properties at the binding sites using DEFMap. For this purpose, three macromolecules, for which cryo-EM maps of both the unbound (apo form) and bound (holo form) states have been determined, were selected (apo, holo: EMD-20080, EMD-20081 [35]; EMD-9616, EMD-9622 [36]; EMD-3957, EMD-3956 [37]), and their dynamic properties were analyzed using DEFMap and MD simulations. As expected, significant suppression of the dynamics at the ligand-binding site was detected in both DEFMap and MD simulations (Fig. 6A).

Figure 6 

Case studies using DEFMap in structural biology research. (A) Comparisons of the dynamics for apo (red) and holo forms (black) at the ligand binding sites. The residues located at the ligand binding sites are identified by 5 Å cutoff from the ligands, and their averaged values are compared. The error bars indicate standard deviations (*p<0.01). (B) Schematic image of the protein assembly of RDM1-DMS3-DRD1 peptide complex. The regions indicated by black dashed rectangle with the labels, 1 and 2, corresponds to the expanded region in (C). (C) Mapping of the differences in the DEFMap-predicted dynamics of RDM1-DMS3 complex between apo and holo forms onto the atomic models. The mapped values are calculated by subtracting the values of the apo form from those of the holo forms. The resulting values are colored with different colors as indicated in the color bar, and lower values denote ligand-induced suppression of the dynamics. DRD1 peptide and disordered regions in apo form are colored by green and dark gray, respectively. The cryo-EM map is represented by light gray color. The expanded images of the regions indicated the black dashed rectangles in the overall image of the complex are shown. The atomic models are derived from PDB [PDB ID: 6ois and 6oit]. (D) Prediction of the dynamics for extremely large macromolecules using DEFMap. The predicted values are mapped onto the 3D cryo-EM maps with different colors as indicated in the color bar. The color range is defined by minimum and maximum values in the individual prediction. The scale bar represents 50 Å and is indicated by black lines.

DEFMap predicts the overall dynamic properties of a molecule. Mapping the difference in dynamics between the apo and holo forms on the 3D map visualizes the overall ligand-induced dynamics changes. Interestingly, for a protein associated with DNA methylation (Fig. 6B), Arabidopsis defective in meristem silencing 3 (DMS3)-RNA-directed DNA methylation 1 (RDM1) complex as an apo form and its complex with the ligand-defective RNA-directed DNA methylation 1 (DRD1) peptide as a holo form (apo, holo: EMD-20080, EMD-20081), additional suppression of the dynamics was observed in a region distant from the ligand-binding site, including the RDM1-DMS3 interaction interface and the hinge region of DMS3 involved in DRD1 peptide recognition (Fig. 6C). This suggests that DRD1 peptide binding stabilizes the RDM1-DMS3 complex formation and the conformation of the DMS3 hinge region. It should be noted that between the experimentally constructed atomic models of the apo and holo forms, no significant structural differences were found in these regions showing conformational stabilization (Fig. 6C). This observation indicates that we cannot find out the dynamic changes only from experimental data, emphasizing the usefulness of DEFMap in structural biology research.

Other advantages of dynamics analysis with DEFMap are that it is not limited by molecular size and does not require an atomic model. Conventional experimental approaches for extremely large molecules, such as viral particles, are significantly hampered by signal crowding. In addition, MD simulations of such molecules require high computational costs and high-resolution atomic models to capture reliable behavior. DEFMap can easily provide their dynamics properties if cryo-EM maps are available [3841] (Fig. 6D).

Limitation in Prediction Using DEFMap

Prediction using a supervised learning technique for data not included in the training dataset generally performs poorly. DEFMap achieves high generalization performance because it learns a vast amount of local density data. Nevertheless, the prediction performances were relatively poor for the maps with extremely low resolutions. Furthermore, for unusual density data, such as those found in the transmembrane regions and those derived from post-translational modifications, the prediction performance is not expected because their data were not included in the current training datasets in consideration of the convenience of performing MD simulations. Expanding the training datasets of the flexible domain data and the unusual density data is a simple solution to these limitations, although it requires high computational costs. Development of an environment to use supercomputers, including the state-of-the-art Fugaku, will strongly support for this solution.

Conclusion

Here, we demonstrated the potential of cryo-EM data for quantitatively analyzing protein dynamics by developing a framework, DEFMap, using an integrated approach of an experimental technique, an MD simulation, and a DL technique. This study advances cryo-EM-based analysis techniques in the field of structural biology. The DEFMap code and the trained models are available in the GitHub repository (https://github.com/clinfo/DEFMap). It can be used by preparing an environment with TensorFlow, Keras, HTMD [42], and EMAN2 [43], all of which are freely available for academic use. However, it may not be easy for some users to construct a computational environment. We have recently developed and released ColabDEFMap, which can be run on Google Colaboratory, a Python programming and execution environment provided by Google (Fig. 7, accessible from the GitHub repository of DEFMap). This will allow more researchers to easily obtain the prediction results from DEFMap without the need for complex environment construction or programming. We hope that the widespread use of DEFMap will accelerate research in structural biology.

Figure 7 

A screenshot of the visualization of the predicted results in ColabDEFMap.

Conflict of Interest

The authors declare no conflict of interests.

Author Contributions

S. M,. S. I., T. K. and Y. O. wrote the manuscript.

Data Availability

The models and the preprocessed input data are available in the Zenodo public repository with the DOI of https://doi.org/10.5281/zenodo.4317158.

Acknowledgements

This work was supported by MEXT as ‘Priority Issue on the Post K computer (Building Innovative Drug Discovery Infrastructure Through Functional Control of Biomolecular Systems)’ and as ‘Program for Promoting Researches on the Supercomputer Fugaku (Application of Molecular Dynamics Simulation to Precision Medicine Using Big Data Integration System for Drug Discovery)’. S.M. was supported by JSPS KAKENHI Grant Number JP17K15106 and JP22K06112.

References
 
© 2023 THE BIOPHYSICAL SOCIETY OF JAPAN
feedback
Top