2020 Volume 17 Pages 161-176
We introduced a conformational sampling method in an earlier report: The multi-dimensional virtual-system coupled molecular dynamics (mD-VcMD) enhances conformational sampling of a biomolecular system by computer simulations. Herein, new sampling method, a subzone-based mD-VcMD, is presented as an extension of mD-VcMD. Then, the subzone-based method is extended further using a genetic algorithm (GA) named the GA-guided mD-VcMD. In these methods, iterative simulation runs are performed to increase the sampled region gradually. The new methods have the following benefits: (1) They are free from a production run: i.e., all snapshots from all iterations are useful for analyses. (2) They are free from fine tuning of a weight function (probability distribution function or potential of mean force). (3) A canonical ensemble (i.e., a thermally equilibrated ensemble) is generated from a simple procedure. A thermodynamic weight is assigned to each snapshot. (4) Selective sampling can be performed for particularly addressing a poorly sampled region without breaking the proportion of the canonical ensemble if the poorly sampled conformational region emerges in sampling. By applying the methods to a simple system that involves an energy barrier between potential-energy minima, we demonstrated that the new methods have considerably higher sampling efficiency than the original mD-VcMD does.
We introduced a generalized ensemble method named “genetic-algorithm guided multi-dimensional virtual-system coupled molecular dynamics (GA-guided mD-VcMD)” to enhances conformational sampling of a biomolecular system. This method is free from fine tuning of a weight function, and a production run is not required. A thermodynamic weight is assigned to each snapshot by a simple procedure. Statistics of poorly sampled conformations are compensated by a selective sampling procedure. This method was applied to a simple system that involves an energy barrier between potential-energy minima, and demonstrated that GA-guided mD-VcMD has a considerably high sampling efficiency.
Generalized ensemble methods are useful for effectively searching a wide conformational space of a molecular system [1,2]. These methods enhance sampling along a reaction coordinate (RC) such as energy or a structural parameter (e.g., a root-mean-square difference from a reference conformation). Moreover, they are powerful for the exploration of considerably stable states (major basins) in the conformational space. However, these methods might overlook less-stable states (minor basins): One can assume that a major basin and a minor basin, the positions of which are separated in the full-dimensional conformational space, mutually overlap in the RC axis. As a result, conformational trapping might occur. The minor basins might be overlooked [3,4]. This oversight does not lead to a severe difficulty when the minor basin is out of scope. However, an alternative approach must be used when the minor basin acts as bridges among the major basins or when it has key features for exerting biophysical functions of the system.
Adaptive umbrella sampling [5,6] can avoid such an oversight or trapping problem because one can set the RC such that the major and minor basins are discriminated along the RC axis. However, this method requires fine tuning of a weight function (potential of mean force (PMF)) along the RC axis. Practically speaking, the difficulty of the fine tuning might result in a very long simulation (or increment of the number of iterative simulations) to sample the conformational space widely.
To avoid these difficulties, we introduced a multi-dimensional virtual-system coupled molecular dynamics (mD-VcMD) simulation [7,8], which is designated herein as the original mD-VcMD (or simply the original method). This method enhances conformational sampling with repetition of iterative simulations in a multi-dimensional RC space. Furthermore, this method is free from fine tuning of the weight function because the weight function does not appear in the proceedings of the iterative simulations. We applied the original mD-VcMD method to a large and complicated system consisting of a ligand (endothelin-1) and its receptor (human endothelin type B receptor; one of the GPCR proteins), where the receptor was wrapped by cholesterols, embedded in an explicit membrane, and surrounded by an explicit solvent [8].
We encountered a difficulty in the original-mD-VcMD study: Once a poorly sampled region emerged in the multi-dimensional RC space in an iterative simulation, this region might cause conformational trapping in a subsequent iteration. Consequently, a more robust method is necessary to advance the iteration.
For this purpose, in the present study, we introduce two methods with extension of the original mD-VcMD: a subzone-based mD-VcMD method, which is an extension of the original mD-VcMD; and a GA-guided mD-VcMD method, which extends the subzone-based mD-VcMD to expand sampling to non-sampled RC regions using genetic algorithm (GA). As presented in this paper, we explain the GA-guided and original mD-VcMD methods. We apply the methods to a simple model with a potential-energy barrier between potential-energy minima. Because of the simplicity of this model, the difficulty of the sampling task is determined by the effective barrier height between the minima, which is modulated readily by changing the simulation temperature. At a high temperature, both the original and GA-guided mD-VcMD methods showed quick convergence of the numerical distribution to the analytical distribution of the system. By contrast, at a lower temperature, only the GA-guided method exhibited quick convergence.
We introduce many terms and quantities in this paper and its related Supplementary Information (SI). Supplementary Table S1 presents positions for which the terms and quantities are defined or explained. Sections 1–4 of SI are preparatory sections for the original mD-VcMD method. Section 5 of SI explains the original mD-VcMD method itself [7,8]. The original method produces the statistical weight Qcano(L) assigned to zones ζL (virtual zones) introduced into the RC space (precise definition of Qcano(L) is given in SI). Integration of Qcano(L) over all zones provides the probability distribution of the system in the entire RC space. The subzone-based and GA-guided mD-VcMD methods are extensions of the original method to compute Qcano(L).
The subzone-based and GA-guided methods are designed to treat a multiple-RC space, just as the original mD-VcMD method. In fact, the original method was applied to a complicated system using a seven-dimensional RC space [8]. However, to simplify the explanation, we use a one-dimensional (1D), two-dimensional (2D), or three-dimensional (3D) RC space in this paper. One can increase the space dimensionality straightforwardly.
Here, we present the subzone-based method to compute Qcano(L). The sections below are based on sections 1–5 of SI. Figure 1A, which is substantially the same as Supplementary Figure S2C, is a portion of a 2D RC space, where four linked zones shown by differently colored frames overlap in an intersection (shaded region). The arrows indicate directions to shift the linked zones to remove the inter-zone overlaps. Figure 1B is an extended RC space resulted from the zone shifting, where the virtual states (indices assigned to nine virtual zones) are numbered as L1, ..., L9 with arbitrary order. Subscripts i for Li are not those to specify linked virtual states (section 3 of SI). The linked virtual states are expressed with parentheses as L(i). The extended RC space is introduced for convenience to make explanation simple. The zone-overlap exists as in Figure 1A in the actual mD-VcMD simulation.
2D RC space is divided into zones. Panel (A) is equivalent to Supplementary Figure S2C. Black arrows indicate directions to shift zones to remove inter-zone overlap (shaded region). (B) Extended RC space resulted from the zone shifts. Li (i=1, ..., 9) and Sj (j=1, ..., 4) are the respective indices for zones and subzones. Shaded and checked subzones are described in the text.
Here we introduce subzones, which overlap perfectly on intersections in the original RC space. The shaded region in Figure 1A corresponds to four shaded regions in Figure 1B. The subzones are these shaded regions in Figure 1B. Similarly, four checked subzones in Figure 1B overlap on another intersection (not presented in Fig. 1A). Once the subzones are introduced, any zone is constructed by subzones. For instance, in Figure 1B, each zone Li consists of four subzones with indices expressed as Sj (j=1, ..., 4). In an m-dimensional RC space, a zone comprises 2m subzones in general. Numbering of the subzones is arbitrary. Whereas L specifies the zone position in the whole RC space, S specifies the relative position of a subzone in each zone.
We express a subzone S in zone ζL as [S, L]. We denote the number of snapshots detected in the subzone S during the iteration M by c[M](S, L), and normalize it by the number of snapshots detected in the zone ζL as
(1) |
This normalization results in
Here, we characterize zones as three types: This zone is called an empty zone (E zone) if all the subzones in zone ζL are not sampled (c[M](S, L)=0,
Time development of sampled and non-sampled subzones in a zone along three (first, second and third) iterations. Four types of time development are presented as panels (A)–(D). “Average [1;3]” is c[1;M](S, L) (Eq. 5) over the three iterations (M=3). Shaded and white subzones respectively represent sampled and non-sampled subzones. Checked subzones for “Zone correction” are corrected ones by Eq. 7 (procedure 2) for panel (C) and by Eq. 8 (procedure 3) for panel (D).
Next, we introduce two delta functions δ and D as
(2) |
and
(3) |
Then, we average the SS count c[I](S, L) (I=1, ..., M) at each [S, L] over the first to M-th iterations with the following procedures.
Procedure 1: If ζL is a CS zone in one or more iterations (Fig. 2A), then the average of SS count is defined as
(4) |
In Eq. 4, IS and E zones are eliminated from the averaged SS counts. Superscript [1;M] denotes that the SS counts are computed using an iteration window from the first to M-th iterations.
Procedure 2: If ζL is an IS or E zone in all iterations and a CS zone is not involved, then average is calculated as
(5) |
The resultant SS counts do not vary when c[I](S, L) is replaced by δ[I](S, L)c[I](S, L) in Eq. 5. A CS zone can result from combining IS zones, as presented in Figure 2B.
By contrast, as shown in Figure 2C, some non-sampled subzones (say [S', L]) might remain after averaging, which results in an IS zone for the zone and c[1;M](S', L)=0. Then, we correct the zero SS counts as follows: First, non-zero SS counts are averaged in ζL:
(6) |
and replace the zero SS count by
(7) |
Figure 2C presents an example for this case. Finally, all subzones in the IS zone have a non-zero value: c[1;M](S, L)>0 (
Procedure 3: All of the SS counts are zero if ζL is an E zone in all the iterations. Then we correct c[1;M](S, L) simply as
(8) |
Figure 2D shows procedure 3.
Computation of Qcano(L) from SS countsIn this section, we present a method to calculate
Assuming nis intersections in the entire RC space, and assuming that these intersections are numbered in an arbitrary order, then we designate these subzones as a subzone pair if a couple of subzones mutually overlap in the RC space. Here, the number of subzone pairs originated from the j-th intersection (j=1, ..., nis) is represented by npair(j): npair(j)=nlink(nlink–1)/2; also, nlink is a function of j (nlink=nlink(j)) because nlink depends on the position of the j-th intersection in the RC space. Figure 3 is a 2D RC space presented in the extension form. Four subzones represented by red circles originate from an intersection. The subzones generate six (=4×3/2) subzone pairs, which are indicated by red lines. Similarly, four subzones shown by blue circles originate from another intersection. Six blue lines specify subzone pairs. Similar arguments can be made for the mD RC space. One can assume a k-th subzone pair with respect to the j-th intersection (k=1, ..., npair(j)). We denote the indices of the pairing subzones by
2D RC space presented in extended form as Figure 1B. Four subzones in red circles overlap to a single intersection. Six red lines represent six subzone pairs. Similarly, four subzones in blue circles overlap to another intersection with six subzone pairs indicated by blue lines. Green lines represent subzone pairs that are related to a zone
Next, we define a modulated SS count ρ by multiplying a factor σL to c[1;M](S, L) as
(9) |
where σL depends on L. Finally, we define an objective function F as
(10) |
where
(11) |
The multiplication factors,
(12) |
where the summation is taken over all subzones in zone ζL.
Minimization of F corresponds to fitting the function c[1;M](S, L) among neighboring zones by modulating σ. This fitting is shown schematically in Figure 4, which is presented in a 1D RC space to make the fitting procedure simple. Figure 4A shows a plot of the SS counts c[1;M](S, L) before fitting, where the SS counts of two pairing subzones (i.e., two open circles in a broken-line rectangle) are not mutually similar. Figure 4B presents the modulated SS counts after fitting, where the two modulated SS counts pairing (circles unified in the rectangle) have mutually equal values.
In both panels, the x-axis is presented by three measures: λ, L, and S. They are scalar because the RC space is 1D in this figure. (A) Open circles represent SS counts c[1;M](L, S). Pairing subzones are connected by broken-line rectangles. Broken line represents the level of c[1;M](L, S)=1. (B) Green-colored filled circles represent modulated SS counts
In the 1D RC space, the objective function F converges analytically to zero by minimization. In a 2D or higher dimensional RC space, however, F might not converge to zero; also, σmin is not determined analytically. Consequently, one might minimize F by a Monte Carlo simulation with modulation of σ. Using the converged σmin,
(13) |
which is equivalent to Eq. S7 in SI. Then the (M+1)-th iteration is done using
It is noteworthy that
Furthermore, we can assess the accuracy of the resultant
(14) |
where Npair(L) represents the number of subzone pairs related to the zone ζL. The summation is taken over all subzone pairs related to the zone ζL. Equation 14 can be regarded as a part of Eq. 10.
If the subzone-based mD-VcMD simulation is sufficiently long, then the modulated SS counts fit well between the pairing subzones in theory, which results in Elocal(L)→0 (
Procedures 1–3 in subsection “Introduction of subzone and subzone’s snapshot count (SS count)” proposed the method to fill SS counts c[1;M](S, L) for all the CS, IS, and E zones when the M-th iteration is finished. Then, we may proceed to the next iteration (iteration M+1) using the obtained SS counts. However, these procedures do not provide a method to presume the SS counts for IS and E zones. Here, we introduce the GA-guided method to fill c[1;M](S, L) in the IS and E zones.
GA databaseFor preparation, we generate a database (called GA database) used for the GA procedure as follows: First, we consider a compact block composed of CS zones:
(15) |
This block is called a GA block. We refer to the zone at the center of the block (Δi=Δj=Δk=···=0) as a central GA zone. Figures 5A and 5B respectively portray the blocks in 2D and 3D RC spaces. In the 2D RC space, GA blocks are classified into three types: inside, side, and corner GA blocks, depending on the position of the central GA zone in the RC space. In the 3D RC space, blocks of four types are possible: inside, face, side, and corner GA blocks. If a block is not an inside one, then the variations of Δi, Δj, Δk, ... in Eq. 15 might range from 0 to +1 or from –1 to 0. In a higher dimensional RC space, we should classify the blocks into more than four types. When the M-th iteration has been done, we have SS counts from various iteration windows: c[1;1](S, L), c[1;2](S, L), ..., c[1;M](S, L), as computed using Eqs. 4 and 5. Then, the GA database consists of GA blocks from those various iteration windows.
(A) GA blocks in 2D RC space. A GA block, which consists of CS zones, is classified into three depending on the position of the central GA zone (marked with a red circle) in the RC space: blocks for which the central GA zones are at the side and corner of the RC space are called side and corner GA blocks, respectively. A block is called an inside GA block if the central GA zone is inside the RC space. (B) In 3D RC space, blocks of four types are possible: inside, face, side, and corner. The central GA zone for the inside GA block is hidden. Although this figure is presented for blocks in GA database, the same block classification is possible for recovery blocks and blocks in GA generation Genk.
Figure 6A portrays an example of GA blocks in the 2D RC space. We designate the GA blocks as GAj (j=1, ..., 6), where j is set in an arbitrary order. The GA database might involve GA blocks that are taken from the same area of the RC space but from different iteration windows. We regard those GA blocks as different ones in the GA database. Consequently, the number of GA blocks in the GA database increases with progress of the iteration.
(A) GA blocks and recovery blocks in 2D RC space. Six GA blocks GA1, ..., GA6 are shown by frames of different colors. Its central GA zone is marked by a circle with the same color. CS, IS, and E zones in GA blocks are shown by rectangles of different tones. Broken-line frames represent recovery blocks. (B) Four recovery blocks (REC1, ..., REC4) and their recovery zones (REC zone 1, ..., 4) are selected from panel (A). Although more recovery zones are possible in panel (A), they are not discussed herein. (C) Reset recovery blocks REC1, ..., REC4.
Each zone in GAi is expressed by its relative position to the central GA zone. The relative position is expressed using a relative index Λ: Λ=[Λ(α), Λ(β), Λ(γ), ...] and Λ(h)=–1, 0, 1 (h=α, β, γ, ...) if the block GAi is an inside block. For a non-inside block, Λj ranges as Λ(h)=–1, 0 or Λ(h)=0, –1. It is noteworthy that a subzone S in a zone Λ is expressed by its relative position in the zone (Fig. 1B). Then, we specify the subzone by [S, Λ]. The SS count for the subzone [S, Λ] in GAi is expressed as
Next, we proceed to estimation of SS counts in IS and E zones without using procedures 2 and 3 in subsection “Introduction of subzone and subzone’s snapshot count (SS count)”. After the M-th iteration, we select an IS or E zone. One can assume that the selected zone is surrounded by Nsurr zones. For instance, in the 2D RC space, Nsurr=8, 5 or 3 for an inside, side, or corner zone, respectively (Fig. 5A), and for the 3D RC zone, Nsurr=26, 17, 11, or 7 for an inside, face, side, or corner zone, respectively (Fig. 5B). If the selected IS/E zone is surrounded by more than Nsurr/2 CS zones, then this selected zone is designated as a recovery zone (REC zone). Figure 6B presents examples of four REC zones, where REC zones 1, 2, and 3 are surrounded respectively by six, six, and five CS zones of eight zones (Nsurr=8). REC zone 4 is surrounded by four CS zones of five zones (Nsurr=5). A compact block consisting of the REC zone and zones surrounding the REC zone is called a recovery block (REC block) if a selected zone is judged as a REC zone. The i-th REC block is denoted by RECi. Similarly to GA blocks, REC blocks are classified into some types: inside, side, and corner blocks for a 2D RC space, and inside, face, side, corner blocks for a 3D RC space.
In fact, more REC zones can be taken than those shown in Figure 6A. Note that the REC blocks can overlap in the RC space: REC2 overlaps REC1 and REC3 in Figure 6A. We also note that the SS counts, c[1;M](S, L), in a REC block are computed only from the iteration window [1;M] (Eqs. 4 and 5), in contrast to those in a GA block, which are computed from various iteration windows [1;1], ..., [1;M].
Resetting REC blockWe use indices [S, Λ] to specify subzones in a REC block, as well as those in a GA block, and refer to an SS count, c[1;M](S, Λ), in RECi as
To assess similarity of the SS counts between blocks RECi and GAj, we introduce a correlation coefficient ccor(RECi, GAj):
(16) |
where
(17) |
(18) |
(19) |
(20) |
(21) |
(22) |
(23) |
The double summations
We convert ccor(RECi, GAj) to a score function Escore(RECi, GAj) as
(24) |
The delta function
One may replace the SS counts of the REC zone of RECi by those of the central GA zone of GAj if Esimi is smaller than a certain value. However, we assess the quality of GAj in the next section to judge if the replacement is appropriate.
Overlap of SS-count patterns among zones in a GA blockThis subsection explains a method to assess the quality of a GA block GAj by introducing Ephys, which quantifies the spatial overlap of SS counts between overlapping subzones within the block. In fact, Ephys is a quantity expressed only by the SS counts in GAj independent of RECi.
We consider two subzones [S, Λ] and [S', Λ'] in GAj. First, we introduce a δ function δov(L, S; L', S'), which is used as a flag to detect overlapping subzones in GAj:
(25) |
For instance, δov(L, S; L', S')=1 for any subzone pair taken from the shaded ones in Figure 1B.
Similarly to Eq. 9, the SS counts
(26) |
where
(27) |
The summation
Although F(σ) (Eq. 10) was defined for the whole RC space, Ephys is defined for a single GA block. Finally, we minimize Ephys with modulating {σL} using a Monte Carlo scheme. The smaller the minimized Ephys(GAj) becomes, the better the quality of the block GAj is. The Ephys used in the next section is this minimized one.
First generation of GATo start the GA procedure, the first generation, denoted as Gen1, is set by taking GA blocks from the GA database. Now, we specifically examine block RECi, which means that the SS counts in RECi are determined by the GA operation. Then, we calculate Esimi(RECi, GAj) and Ephys(GAj) for all the GA blocks in the GA database, and define the score function as
(28) |
where ωs and ωp are weights set by the user, such as 1.0 and 2.0, respectively, as described in our earlier report [9]. The smaller the Escore is, the more similar the SS-count patterns of GAj to RECi and the better the overlap of SS counts among zones in GAj.
Next, we sort the resultant scores
The first generation is prepared for each of the REC blocks. The size of a REC block depends on its place in the RC space, as described previously. We express the size of RECi and GAj as
Presuming that we have the k-th GA generation, denoted as Genk, with respect to a recovery block RECi, then the generation update from Genk to Genk+1 is presented in Figure 7. Remember that the first generation of GA consists of
GA process to update the k-th GA generation Genk to the (k+1)-th one Genk+1. Details for the process are explained in the text.
Next, we randomly select
Up to this point,
To generate Genk+2, we repeat similar procedures to Genk+1 as done for Genk: (1) Calculate the score function between each REC block (say RECi) and all
(29) |
where MEMj,k+1 stands for the j-th member (
We repeat this cycle Ncycle times, where Ncycle is set by the user. Finally, the SS counts of the REC zone in RECi (i.e., the central zone of RECi) are replaced by those of the central zone of the best-scored member of
Figure 8A is a part of 2D RC space, where the magenta-colored circles are REC zones. The SS counts for these REC zones are estimated by the GA (Fig. 8B). Then, one might calculate
Part of 2D RC space for which the axes are L(i) (or λ(i)) and L(j) (or λ(j)). Three zone types are represented by different tones. Colored circles are REC zones described in the text.
We explain details for mutation and exchange. In mutation, a pair of
(30) |
Figure 9 presents an image of slices in the 3D RC space to help understanding of the slices. Each slice consists of nine zones.
Image of slices in the 3D RC space.
In mutation, a slice, say Λ(x), is selected randomly from slices of
In exchange, after selecting a slice randomly from slices of
No middle slice (Λ(x)=0) is used for the mutation and exchange. The GA procedure is used to estimate the SS counts of a REC zone. The REC zone is involved in the middle slices of a REC block. To maintain logical consistency between the REC zone and the member in Genk, we eliminated the middle slice.
A technical point must be noted: In mutation and exchange, the size of the operated slice of a REC block should be included in or should be the same as that of a block in Genk. The size of the operated slice of the block in Genk should be included in or should be the same as that of a GA block.
The initial conformation of the first iteration can be any conformation for all the original, subzone-based and GA-guided mD-vcMD methods. However, to discover pathways from the unstable state to the most stable conformation (i.e., the native tertiary structure), the initial conformation should be far from the most stable conformation. Then, it is readily apparent that such sampling is free from knowledge of the native tertiary structure.
We have performed multiple runs for the mD-VcMD method to obtain the thermodynamically plausible distribution (the canonical distribution) of a biological system [10,11]. One can assume that iteration M, which consists of Nrun runs, is finished. The initial conformation of the i-th run of iteration M+1 can be the last conformation of the i-th run of iteration M. The first iteration is an exception, where the initial conformations are a single conformation as described above.
A more appropriate method to prepare the initial conformations is the following: Imagine an ensemble of the last conformations of the first to M-th iterations. The ensemble comprises of Nrun×M conformations. Supplementary Figure S4A schematically presents those conformations in the RC space, which is presented two-dimensionally for convenience. Then, we divide the RC space into small patches of equal size and count patches that involve one or more conformations. The patches are called “occupied patches.” A patch is not the virtual zone (ζL). Eighteen occupied patches are presented by shaded rectangles in Supplementary Figure S4. Some occupied patches may involve a single conformation. The others involve more conformations. Next, we select Nrun patches randomly from the occupied ones, and select one conformation randomly from each of the selected patches. Red-filled circles are shown in the figure assuming that Nrun=10. Those conformations are used for the initial conformations for iteration M+1. We divide the RC space into smaller patches if the number of occupied patches is smaller than Nrun. However, if the number of occupied patches is considerably larger than Nrun, then we increase the patch size. We designate this method as the “evenly distributed-initial-conformation (EDIC) method.”
Selectively distributed initial conformations (SDIC)Another method can be used to set the initial conformations. It is simple and useful when a poorly sampled region emerged in the RC space. Assuming again that the M-th iteration has finished and assuming that an RC region was poorly sampled in all iterations, then the accuracy of
This selective sampling procedure does not work if
By contrast, in the subzone-based and GA-guided methods, the artificial increase of
We introduced a selective sampling procedure in an earlier report [8]. However, the selective sampling introduced herein differs from that described in that earlier report.
Thermodynamic Weight Assigned to Sampled ConformationsIt is fundamentally important to assign a thermodynamic weight to each snapshot for analyzing the resultant conformational ensemble. Given a zone ζL, the thermodynamic weight assigned to ζL is Qcano(L). If the simulation is at the M-th iteration, then
In the subzone-based and GA methods, however, sampling is flexible. For instance, according to the selective sampling proposed above, the number of snapshots in zones might be considerably uneven, which means that the thermodynamic weight assigned to a snapshot is not equivalent to Qcano(L). We present a simple method for assigning weight: Consider a snapshot, as detected in ζL, and presume that this snapshot is the i-th one in the conformational ensemble. Then, the thermodynamic weight, wi, assigned to the i-th snapshot is
(31) |
where
We earlier applied the original mD-VcMD method and a conventional Monte Carlo method to a model system [12]. Although the analytical conformational distribution of the system was not obtained because of the complexity of its potential energy function, both methods mutually converged. Importantly, the convergence from the original mD-VcMD was 200 times quicker than that from the conventional Monte Carlo method.
As described below, we check the efficiency of the GA-guided and original mD-VcMD methods using another model system for which the conformational distribution is obtainable analytically.
Model systemIn the model, a single particle with coordinates expressed by [x, y, z], moves in a 3D space at a temperature on a potential-energy surface, Emodel, defined as
(32) |
where r is the distance of the particle from the origin [0, 0, 0]:
(33) |
Actually, Emodel has a spherical symmetry around the origin in the 3D space. Parameters D, a, and b in Eq. 32 are set respectively to 0.045, 10, and 20. We set the temperature T to 1.5 or 3.0 for the simulations. Parameters D, a, b and temperature T are introduced as non-dimensional quantities in this application. Then, the Boltzmann factor is defined simply as exp[–Emodel/T].
Originally, the RCs λ(α), λ(β), and λ(γ) were defined by the inter-atom-group distances in a molecular system (Section 1 of SI), and the motions of RCs were calculated from the motions of the molecular conformation. By contrast, in this application, λ(α), λ(β), and λ(γ) respectively correspond directly to x, y, and z without assuming the molecular system. The function Emodel(r) is simple as shown in Supplementary Figure S5A of SI. This figure shows two energy basins at r=0.0 and r≈4.0, which are mutually separated by an energy barrier at r≈2.0. Supplementary Figure S6 demonstrates the potential energy in the 2D space (r2=x2+y2; z=0). We designate the energy basins at r=0.0 and r≈4.0 respectively as the “central basin” and “periphery basin.” The periphery basin is deeper than the central basin. The energy barrier at r≈2.0 is called “major barrier.”
We restrict the sampling region in a cubic box by setting the lower and upper limits to the coordinates as –3.0≤x≤3.0, –3.0≤y≤3.0, and –3.0≤z≤3.0. In the outside of the cubic box, Emodel increases rapidly. By this restriction, this model system has nine energy basins in a strict sense. One is the central basin at the origin (x=y=z=0). The periphery basin at r≈2.0 in Supplementary Figure S5A is split into eight sub-basins near the edges of the cubic box. The energy barriers among the eight sub-basins are considerably lower than the major barrier. They are overcome easily during simulation at the examined temperatures. Therefore, we do not specifically examine differences among the eight sub-basins but specifically examine traversal between the central and periphery basins in this study.
The virtual zones are set as listed in Supplementary Table S2. We divided each of x-, y-, and z-axes equally into 11 virtual zones (
The difficulty of sampling can be modulated by varying the system temperature. Supplementary Figure S5B shows the Boltzmann factor, exp[–Emodel(r)/T], assigned to a microscopic state located at r. We note that the factor is not 4πr2exp[–Emodel(r)/T]. Measuring from the most probable position (r≈2.0), the probability at the major barrier is about 10–5 at T=3.0 and 10–10 at T=1.5. This difference of the probability strongly influences the sampling efficiency of the original mD-VcMD method, as demonstrated later. We used a computer program myPresto/omegagene for the simulations [13].
SimulationsWe moved the particle by a Monte Carlo (MC) simulation for both the original and GA-guided mD-VcMD simulations. The current simulation should be called “VcMC” because MC is done instead of molecular dynamics (MD). However, “VcMD” is used for convenience in this study. Step size Δr=[Δx2+Δy2+Δz2]1/2 of the MC simulation was 0.02: Δr distributed evenly from 0.0 to 0.02 isotropically in the 3D space. The initial position of the particle for the first iteration was set at the origin [0, 0, 0]. We performed 30 runs in each iteration (Nrun=30). The number of simulation steps for a run was 1×107. The total simulation length for an iteration was 3×108 steps. We used the EDIC method to prepare the initial conformations for iteration M (M≠1).
In the GA-guided method, when an iteration finishes, CS, IS, and E zones are detected. Then, REC zones are set and the SS counts in the REC zones are calculated to proceed into the next iteration. Generally, with the progress of the iterative simulation, the CS zones increases. Therefore, the IS or E zones decreases, which means that the REC zones decrease. Finally, when the IS or E zones vanished (i.e., the REC zones vanished), we switched the GA-guided method to the subzone-based method.
For the GA-guided VmMD method at T=1.5 and 3.0, respectively, 10 and 5 iterations were performed, whereas 13 and 5 iterations were done, respectively, for the original mD-VcMD at T=1.5 and 3.0. The interval for attempting inter-virtual state transition (Δτ; see section 4 of SI) is set to 10 steps of MC simulation for both the GA-guided and original VmMD methods.
To assess the sampling, we calculated correlation coefficients between the resultant
(34) |
where
(35) |
and
(36) |
The terms in Eq. 34 are given as
(37) |
(38) |
(39) |
(40) |
and
(41) |
Therein, Nsite represents the number of virtual zones in the system (Nsite=11×11×11). Supplementary Figure S5B shows that the Boltzmann factor in the periphery basin (r≈4) is considerably larger than that of the other regions at both temperatures. Therefore,
To assess the overall shape of Qcano(L) in the correlation calculation, we defined another coefficient with setting g[M] and ganal as
(42) |
and
(43) |
Then, the new correlation coefficient is defined as
(44) |
The terms involved in Eq. 44 are defined by the same equations as Eqs. 37–41 with replacement of f[M]→g[M] and fanal→ganal.
Figures 10A and 10B demonstrate the correlation functions as a function of iteration No. M calculated respectively from the GA-guided mD-VcMD at T=1.5 and 3.0. Figures 10C and 10D respectively show those from the original mD-VcMD at T=1.5 and 3.0. Figures 10B and 10D indicate that both the GA-guided and original methods provided a quick convergence of
Correlation coefficients,
At the lower temperature (T=1.5), the free-energy barrier became considerably higher than that at T=3.0. For the GA-guided method,
Figure 11 displays the spatial patterns of
Spatial patterns of Qcano(L) (i.e., dependence of Qcano on L) at T=1.5. Although three sides of the box are parallel to L(x)-, L(y)-, and L(y)-axes, because potential energy Emodel has spherical symmetry, we do not specify the axes. (A) Patterns for analytical
Lastly in this application section, we assessed the SDIC method by assuming a situation that the central region (i.e., the central basin) was insufficiently sampled during preceding iterations. Note that the SDIC method is not applicable to the original mD-VcMD method, in theory.
In the subsection presented above, we obtained
(45) |
Subsequently, we performed the 11-th iteration of the GA-guided method at T=1.5 using
Results show high correlation coefficients between
Therefore, we re-calculated the correlation coefficients locally taking zones with L(h)=3, ..., 9
Benefits of the subzone-based and GA-guided methods are the following: (1) All snapshots from all iterations are useful for analysis: no production run is necessary, or all simulations are production run. By contrast, in the original mD-VcMD method or other enhanced sampling methods, the production run is mandatory usually, and only the production run is used for analysis. (2) To proceed with sampling, no fine tuning of a weight function is used, which is also a benefit in comparison to multicanonical or other adaptive umbrella sampling methods. (3) The statistical weight wi (Eq. 31) is found after sampling. The method used to compute wi is simple. (4) Statistical significance of a poorly sampled region can be increased by selective sampling. In the original mD-VcMD method, the selective sampling accumulates local errors in Qcano(L), which results in slow convergence.
We applied the GA-guided and original mD-VcMD methods to the model system. The model system simplicity enables control of the sampling problem difficulty by tuning the height of the free-energy barrier (i.e., by varying the simulation temperature). At the higher temperature, both the original and GA-guided methods worked well, showing quick convergence of Qcano(L) to the analytical one. However, at the lower temperature, only the GA-guided method showed quick convergence. Furthermore, we demonstrated that the SDIC method worked well to provide Qcano(L) in the poorly sampled region without breaking the global shape of Qcano(L) in the RC space.
GA-guided mD-VcMD is applicable to any complicated biomolecular system for which the potential energy is defined for an MD simulation. We generated computer programs for the GA-guided mD-VcMD method, and applied them to molecular binding of a flexible medium-sized ligand (a peptide of about ten amino-acid residues long) to a receptor protein in an explicit solvent [9]. The resultant conformational ensemble produced a binding free-energy landscape in which various structural clusters (the unbound state, fragile encounter complexes, and the most stable native-like complex) are distributed. This result demonstrated that the GA-guided mD-VcMD simulation sampled the wide conformational space.
All authors declare that they have no conflict of interest.
All authors contributed equally to this study.
A preliminary version of this work was deposited in the arxiv (2006.06950v2) on August 13, 2020.
We deeply appreciate Prof. Haruki Nakamura from Osaka Univ. for insightful comments. J. H. was supported by JSPS KAKENHI Grant No. 16K05517 and by the Development of Core Technologies for Innovative Drug Development Based Upon IT from Japan Agency for Medical Research and Development (AMED). N. K. was supported by a JSPS KAKENHI Grant (No. JP20H03229). K. K. was supported by JSPS KAKENHI Grant No. 16K18526. This work was supported as an HPCI System Research Project (Project IDs: hp190017, hp190018, hp190027, hp200063, hp200090, and hp200025). Simulations were performed on the TSUBAME3.0 supercomputers at the Tokyo Institute of Technology, in part under the Cooperative Research Program of the Institute for Protein Research, Osaka University, CR-19-05 and CR-20-05.