GA-guided mD-VcMD: A genetic-algorithm-guided method for multi-dimensional virtual-system coupled molecular dynamics

Junichi Higo; Ayumi Kusaka; Kota Kasahara; Narutoshi Kamiya; Itaya Hayato; Xie Qilin; Takuya Takahashi; Ikuo Fukuda; Kentaro Mori; Yutaka Hata; Yoshifumi Fukunishi

doi:10.2142/biophysico.BSJ-2020008

Abstract

We introduced a conformational sampling method in an earlier report: The multi-dimensional virtual-system coupled molecular dynamics (mD-VcMD) enhances conformational sampling of a biomolecular system by computer simulations. Herein, new sampling method, a subzone-based mD-VcMD, is presented as an extension of mD-VcMD. Then, the subzone-based method is extended further using a genetic algorithm (GA) named the GA-guided mD-VcMD. In these methods, iterative simulation runs are performed to increase the sampled region gradually. The new methods have the following benefits: (1) They are free from a production run: i.e., all snapshots from all iterations are useful for analyses. (2) They are free from fine tuning of a weight function (probability distribution function or potential of mean force). (3) A canonical ensemble (i.e., a thermally equilibrated ensemble) is generated from a simple procedure. A thermodynamic weight is assigned to each snapshot. (4) Selective sampling can be performed for particularly addressing a poorly sampled region without breaking the proportion of the canonical ensemble if the poorly sampled conformational region emerges in sampling. By applying the methods to a simple system that involves an energy barrier between potential-energy minima, we demonstrated that the new methods have considerably higher sampling efficiency than the original mD-VcMD does.

Significance

We introduced a generalized ensemble method named “genetic-algorithm guided multi-dimensional virtual-system coupled molecular dynamics (GA-guided mD-VcMD)” to enhances conformational sampling of a biomolecular system. This method is free from fine tuning of a weight function, and a production run is not required. A thermodynamic weight is assigned to each snapshot by a simple procedure. Statistics of poorly sampled conformations are compensated by a selective sampling procedure. This method was applied to a simple system that involves an energy barrier between potential-energy minima, and demonstrated that GA-guided mD-VcMD has a considerably high sampling efficiency.

Introduction

Generalized ensemble methods are useful for effectively searching a wide conformational space of a molecular system [1,2]. These methods enhance sampling along a reaction coordinate (RC) such as energy or a structural parameter (e.g., a root-mean-square difference from a reference conformation). Moreover, they are powerful for the exploration of considerably stable states (major basins) in the conformational space. However, these methods might overlook less-stable states (minor basins): One can assume that a major basin and a minor basin, the positions of which are separated in the full-dimensional conformational space, mutually overlap in the RC axis. As a result, conformational trapping might occur. The minor basins might be overlooked [3,4]. This oversight does not lead to a severe difficulty when the minor basin is out of scope. However, an alternative approach must be used when the minor basin acts as bridges among the major basins or when it has key features for exerting biophysical functions of the system.

Adaptive umbrella sampling [5,6] can avoid such an oversight or trapping problem because one can set the RC such that the major and minor basins are discriminated along the RC axis. However, this method requires fine tuning of a weight function (potential of mean force (PMF)) along the RC axis. Practically speaking, the difficulty of the fine tuning might result in a very long simulation (or increment of the number of iterative simulations) to sample the conformational space widely.

To avoid these difficulties, we introduced a multi-dimensional virtual-system coupled molecular dynamics (mD-VcMD) simulation [7,8], which is designated herein as the original mD-VcMD (or simply the original method). This method enhances conformational sampling with repetition of iterative simulations in a multi-dimensional RC space. Furthermore, this method is free from fine tuning of the weight function because the weight function does not appear in the proceedings of the iterative simulations. We applied the original mD-VcMD method to a large and complicated system consisting of a ligand (endothelin-1) and its receptor (human endothelin type B receptor; one of the GPCR proteins), where the receptor was wrapped by cholesterols, embedded in an explicit membrane, and surrounded by an explicit solvent [8].

We encountered a difficulty in the original-mD-VcMD study: Once a poorly sampled region emerged in the multi-dimensional RC space in an iterative simulation, this region might cause conformational trapping in a subsequent iteration. Consequently, a more robust method is necessary to advance the iteration.

For this purpose, in the present study, we introduce two methods with extension of the original mD-VcMD: a subzone-based mD-VcMD method, which is an extension of the original mD-VcMD; and a GA-guided mD-VcMD method, which extends the subzone-based mD-VcMD to expand sampling to non-sampled RC regions using genetic algorithm (GA). As presented in this paper, we explain the GA-guided and original mD-VcMD methods. We apply the methods to a simple model with a potential-energy barrier between potential-energy minima. Because of the simplicity of this model, the difficulty of the sampling task is determined by the effective barrier height between the minima, which is modulated readily by changing the simulation temperature. At a high temperature, both the original and GA-guided mD-VcMD methods showed quick convergence of the numerical distribution to the analytical distribution of the system. By contrast, at a lower temperature, only the GA-guided method exhibited quick convergence.

Methods and Protocols

We introduce many terms and quantities in this paper and its related Supplementary Information (SI). Supplementary Table S1 presents positions for which the terms and quantities are defined or explained. Sections 1–4 of SI are preparatory sections for the original mD-VcMD method. Section 5 of SI explains the original mD-VcMD method itself [7,8]. The original method produces the statistical weight Q_cano(L) assigned to zones ζ_L (virtual zones) introduced into the RC space (precise definition of Q_cano(L) is given in SI). Integration of Q_cano(L) over all zones provides the probability distribution of the system in the entire RC space. The subzone-based and GA-guided mD-VcMD methods are extensions of the original method to compute Q_cano(L).

The subzone-based and GA-guided methods are designed to treat a multiple-RC space, just as the original mD-VcMD method. In fact, the original method was applied to a complicated system using a seven-dimensional RC space [8]. However, to simplify the explanation, we use a one-dimensional (1D), two-dimensional (2D), or three-dimensional (3D) RC space in this paper. One can increase the space dimensionality straightforwardly.

Subzone-based mD-VcMD Method to Compute Q_cano(L)

Introduction of subzone and subzone’s snapshot count (SS count)

Here, we present the subzone-based method to compute Q_cano(L). The sections below are based on sections 1–5 of SI. Figure 1A, which is substantially the same as Supplementary Figure S2C, is a portion of a 2D RC space, where four linked zones shown by differently colored frames overlap in an intersection (shaded region). The arrows indicate directions to shift the linked zones to remove the inter-zone overlaps. Figure 1B is an extended RC space resulted from the zone shifting, where the virtual states (indices assigned to nine virtual zones) are numbered as L₁, ..., L₉ with arbitrary order. Subscripts i for L_i are not those to specify linked virtual states (section 3 of SI). The linked virtual states are expressed with parentheses as L₍_i₎. The extended RC space is introduced for convenience to make explanation simple. The zone-overlap exists as in Figure 1A in the actual mD-VcMD simulation.

Figure 1

2D RC space is divided into zones. Panel (A) is equivalent to Supplementary Figure S2C. Black arrows indicate directions to shift zones to remove inter-zone overlap (shaded region). (B) Extended RC space resulted from the zone shifts. L_i (i=1, ..., 9) and S_j (j=1, ..., 4) are the respective indices for zones and subzones. Shaded and checked subzones are described in the text.

Here we introduce subzones, which overlap perfectly on intersections in the original RC space. The shaded region in Figure 1A corresponds to four shaded regions in Figure 1B. The subzones are these shaded regions in Figure 1B. Similarly, four checked subzones in Figure 1B overlap on another intersection (not presented in Fig. 1A). Once the subzones are introduced, any zone is constructed by subzones. For instance, in Figure 1B, each zone L_i consists of four subzones with indices expressed as S_j (j=1, ..., 4). In an m-dimensional RC space, a zone comprises 2^m subzones in general. Numbering of the subzones is arbitrary. Whereas L specifies the zone position in the whole RC space, S specifies the relative position of a subzone in each zone.

We express a subzone S in zone ζ_L as [S, L]. We denote the number of snapshots detected in the subzone S during the iteration M by c^[^M^](S, L), and normalize it by the number of snapshots detected in the zone ζ_L as

c[M]S,L∑S'∈Lc[M]S',L→c[M]S,L.

(1)

This normalization results in ∑S∈Lc[M]S,L=1. We refer to this quantity c^[^M^](S, L) as subzone’s snapshot count (SS count) at [S, L]. The SS count for a non-sampled subzone in iteration M is zero: c^[^M^](S, L)=0.

Here, we characterize zones as three types: This zone is called an empty zone (E zone) if all the subzones in zone ζ_L are not sampled (c^[^M^](S, L)=0, ∀S in L). This zone is designated as a completely sampled zone (CS zone) if all subzones in ζ_L are sampled (c^[^M^](S, L)>0, ∀S in L). Finally, if some subzone is empty and the others are sampled in ζ_L, then this zone is called an incompletely sampled zone (IS zone). Figure 2 presents those zones assuming that the first, second, and third iterations are performed sequentially using a 2D RC space.

Figure 2

Time development of sampled and non-sampled subzones in a zone along three (first, second and third) iterations. Four types of time development are presented as panels (A)–(D). “Average [1;3]” is c^[1;^M^](S, L) (Eq. 5) over the three iterations (M=3). Shaded and white subzones respectively represent sampled and non-sampled subzones. Checked subzones for “Zone correction” are corrected ones by Eq. 7 (procedure 2) for panel (C) and by Eq. 8 (procedure 3) for panel (D).

Next, we introduce two delta functions δ and D as

δ[M]S,L=1 (if cMS,L>0 in iteration M)0 if cMS,L=0

(2)

and

D[M]L=1 (if ζL is CS zone in iteration M)0 (otherwize).

(3)

Then, we average the SS count c^[^I^](S, L) (I=1, ..., M) at each [S, L] over the first to M-th iterations with the following procedures.

Procedure 1: If ζ_L is a CS zone in one or more iterations (Fig. 2A), then the average of SS count is defined as

c[1;M]S,L=∑I=1MD[I]L c[I]S,L∑I=1MD[I]L.

(4)

In Eq. 4, IS and E zones are eliminated from the averaged SS counts. Superscript [1;M] denotes that the SS counts are computed using an iteration window from the first to M-th iterations.

Procedure 2: If ζ_L is an IS or E zone in all iterations and a CS zone is not involved, then average is calculated as

c[1;M]S,L=∑I=1M c[I]S,L∑I=1Mδ[I]L,S.

(5)

The resultant SS counts do not vary when c^[^I^](S, L) is replaced by δ^[^I^](S, L)c^[^I^](S, L) in Eq. 5. A CS zone can result from combining IS zones, as presented in Figure 2B.

By contrast, as shown in Figure 2C, some non-sampled subzones (say [S', L]) might remain after averaging, which results in an IS zone for the zone and c^[1;^M^](S', L)=0. Then, we correct the zero SS counts as follows: First, non-zero SS counts are averaged in ζ_L:

c-(L) =∑S∈Lc[1;M]S,L∑S∈Lδ[1;M]S,L,

(6)

and replace the zero SS count by c-(L):

c[1;M]S',L=c-L (for nonsampled subzones)

(7)

Figure 2C presents an example for this case. Finally, all subzones in the IS zone have a non-zero value: c^[1;^M^](S, L)>0 (∀S in ζ_L).

Procedure 3: All of the SS counts are zero if ζ_L is an E zone in all the iterations. Then we correct c^[1;^M^](S, L) simply as

c[1;M]S,L=1 (if ζL is E zone in all iterations).

(8)

Figure 2D shows procedure 3.

Computation of Q_cano(L) from SS counts

In this section, we present a method to calculate Qcano[M]L by fitting SS counts c^[1;^M^](S, L) among neighboring subzones. The idea of subzone fitting is originated from our earlier study [4] (see Eq. 4.2 of Ref. [4]). The fitting procedure is shown schematically by the distribution transformation presented in the earlier report (Fig. 5B→Fig. 5C of Ref. [4]), where fractions (short colored curves) of a canonical distribution produce the full distribution (black-line curve). Although the fitting method presented in that earlier report is not the SS count-fitting but function-fitting, the logic is the same.

Assuming n_is intersections in the entire RC space, and assuming that these intersections are numbered in an arbitrary order, then we designate these subzones as a subzone pair if a couple of subzones mutually overlap in the RC space. Here, the number of subzone pairs originated from the j-th intersection (j=1, ..., n_is) is represented by n_pair(j): n_pair(j)=n_link(n_link–1)/2; also, n_link is a function of j (n_link=n_link(j)) because n_link depends on the position of the j-th intersection in the RC space. Figure 3 is a 2D RC space presented in the extension form. Four subzones represented by red circles originate from an intersection. The subzones generate six (=4×3/2) subzone pairs, which are indicated by red lines. Similarly, four subzones shown by blue circles originate from another intersection. Six blue lines specify subzone pairs. Similar arguments can be made for the mD RC space. One can assume a k-th subzone pair with respect to the j-th intersection (k=1, ..., n_pair(j)). We denote the indices of the pairing subzones by pk,j(1) and pk,j(2), where superscripts (1) and (2) show simply that the two subzones are pairing. Similarly, we designate the zone indices (virtual states) regarding the subzone pairs as qk,j(1) and qk,j(2). Then, the pairing subzones in the entire RC space are expressed as [Spk,j(1),Lqk,j(1)] and [Spk,j(2),Lqk,j(2)]. Any of the pairing subzones in the RC space can be specified with varying k and j.

Figure 3

2D RC space presented in extended form as Figure 1B. Four subzones in red circles overlap to a single intersection. Six red lines represent six subzone pairs. Similarly, four subzones in blue circles overlap to another intersection with six subzone pairs indicated by blue lines. Green lines represent subzone pairs that are related to a zone ζL7, as described in the text.

Next, we define a modulated SS count ρ by multiplying a factor σ_L to c^[1;^M^](S, L) as

ρ[M]S,L;σL=σLc[1;M]S,L,

(9)

where σ_L depends on L. Finally, we define an objective function F as

F(σ)=∑j=1nis∑k=1npair(j)fj,k-1,

(10)

where

fj,k=max⁡ρMSpk,j(1),Lqk,j(1);σLqk,j(1),ρMSpk,j(2),Lqk,j(2);σLqk,j(2)min⁡ρMSpk,j(1),Lqk,j(1);σLqk,j(1),ρMSpk,j(2),Lqk,j(2);σLqk,j(2).

(11)

The multiplication factors, σL1, σL2, ..., are packed in a vector form: σ=[σL1,σL2,⋯]. Equation 11 is invariant in exchange of superscripts (1) and (2). Double summations in Eq. 11 move over all subzone pairs in the entire RC space. The “–1” in Eq. 10 is introduced to set F to zero when ρM(Spk,j1,Lqk,j1;σLqk,j1)=ρM(Spk,j2,Lqk,j2;σLqk,j2). Then, we minimize F by modulating σ and refer to the resultant σ providing the minimized F as σmin=[σL1min,σL2min,…]. Finally, Qcano[M](L) is given as

Qcano[M]L=∑Si∈L ρ[M]Si,L;σLmin,

(12)

where the summation is taken over all subzones in zone ζ_L.

Minimization of F corresponds to fitting the function c^[1;^M^](S, L) among neighboring zones by modulating σ. This fitting is shown schematically in Figure 4, which is presented in a 1D RC space to make the fitting procedure simple. Figure 4A shows a plot of the SS counts c^[1;^M^](S, L) before fitting, where the SS counts of two pairing subzones (i.e., two open circles in a broken-line rectangle) are not mutually similar. Figure 4B presents the modulated SS counts after fitting, where the two modulated SS counts pairing (circles unified in the rectangle) have mutually equal values.

Figure 4

In both panels, the x-axis is presented by three measures: λ, L, and S. They are scalar because the RC space is 1D in this figure. (A) Open circles represent SS counts c^[1;^M^](L, S). Pairing subzones are connected by broken-line rectangles. Broken line represents the level of c^[1;^M^](L, S)=1. (B) Green-colored filled circles represent modulated SS counts ρ[M]Sk,Lj;σLmin. Red filled circles are QcanoML/2={ρMS1,Lk;σLkmin+ρMS2,Lk;σLkmin}/2. Pairing subzones, which are unified in broken-line rectangles, have the same modulated SS count.

In the 1D RC space, the objective function F converges analytically to zero by minimization. In a 2D or higher dimensional RC space, however, F might not converge to zero; also, σ^min is not determined analytically. Consequently, one might minimize F by a Monte Carlo simulation with modulation of σ. Using the converged σ^min, Qcano[M]L is calculated (Eq. 12). Also, JL(i) is set as

JL(i)[M]=Qcano[M]L(i)∑j=1nlink1Qcano[M]L(j) -1,

(13)

which is equivalent to Eq. S7 in SI. Then the (M+1)-th iteration is done using JL(i)[M].

It is noteworthy that Qcano[M]L is calculated using Qcano[M-1]L in the original method (Eq. S8 of SI). Also, a convergence check (Qentire[M]L≈const; see section 5 of SI) is necessary to judge the quality of the resultant Qcano[M]L. The subzone-based method requires neither Qcano[M-1] nor a convergence check. The fewer requirements stand as a benefit of the subzone-based method, and also of the GA method explained later.

Furthermore, we can assess the accuracy of the resultant Qentire[M]L locally in each zone: Given a zone ζ_L, we first list subzone pairs relating to the zone. The green lines in Figure 3 show subzone pairs related to a zone ζL7. We express the i-th subzone pair as [Spi(1),L] and [Spi(2),Lqi(2)]. In the i-th subzone pair, the superscript (1) is assigned to the subzone [Spi(1),L] involved in ζ_L. Also, (2) represents subzone [Spi(2),Lqi(2)], which is not involved in ζ_L in the subzone pair. Then, a local function, E_local(L), is defined as

ElocalL= 1Npair(L)∑i=1Npair(L)max⁡ρMSpi(1),L;σLmin,ρMSpi(2),Lqi(2);σLqi(2)minmin⁡ρMSpi(1),L;σLmin,ρMSpi(2),Lqi(2);σLqi(2)min -1,

(14)

where N_pair(L) represents the number of subzone pairs related to the zone ζ_L. The summation is taken over all subzone pairs related to the zone ζ_L. Equation 14 can be regarded as a part of Eq. 10.

If the subzone-based mD-VcMD simulation is sufficiently long, then the modulated SS counts fit well between the pairing subzones in theory, which results in E_local(L)→0 (∀L). In an actual simulation, the sampling accuracy might be uneven: Some RC regions might be sampled poorly. One need not sample this region further if the region is unimportant biophysically. However, if the region is important, then the region should be sampled selectively. A simple method for selective sampling is presented in subsection “Selectively distributed initial conformations (SDIC)”. We can judge that a zone ζ_L is sampled accurately when E_local(L)≤E_sh is satisfied, where E_sh is a threshold set by a user. For our actual sampling, we set E_sh=4 [9].

GA to Determine c^[1;^M^](S, L) for IS and E Zones

Procedures 1–3 in subsection “Introduction of subzone and subzone’s snapshot count (SS count)” proposed the method to fill SS counts c^[1;^M^](S, L) for all the CS, IS, and E zones when the M-th iteration is finished. Then, we may proceed to the next iteration (iteration M+1) using the obtained SS counts. However, these procedures do not provide a method to presume the SS counts for IS and E zones. Here, we introduce the GA-guided method to fill c^[1;^M^](S, L) in the IS and E zones.

GA database

For preparation, we generate a database (called GA database) used for the GA procedure as follows: First, we consider a compact block composed of CS zones:

{ζL: L= i+∆i,j+∆j,k+∆k,…; ∆i, ∆j, ∆k,… =-1, 0,+1]}.

(15)

This block is called a GA block. We refer to the zone at the center of the block (Δi=Δj=Δk=···=0) as a central GA zone. Figures 5A and 5B respectively portray the blocks in 2D and 3D RC spaces. In the 2D RC space, GA blocks are classified into three types: inside, side, and corner GA blocks, depending on the position of the central GA zone in the RC space. In the 3D RC space, blocks of four types are possible: inside, face, side, and corner GA blocks. If a block is not an inside one, then the variations of Δi, Δj, Δk, ... in Eq. 15 might range from 0 to +1 or from –1 to 0. In a higher dimensional RC space, we should classify the blocks into more than four types. When the M-th iteration has been done, we have SS counts from various iteration windows: c^[1;1](S, L), c^[1;2](S, L), ..., c^[1;^M^](S, L), as computed using Eqs. 4 and 5. Then, the GA database consists of GA blocks from those various iteration windows.

Figure 5

(A) GA blocks in 2D RC space. A GA block, which consists of CS zones, is classified into three depending on the position of the central GA zone (marked with a red circle) in the RC space: blocks for which the central GA zones are at the side and corner of the RC space are called side and corner GA blocks, respectively. A block is called an inside GA block if the central GA zone is inside the RC space. (B) In 3D RC space, blocks of four types are possible: inside, face, side, and corner. The central GA zone for the inside GA block is hidden. Although this figure is presented for blocks in GA database, the same block classification is possible for recovery blocks and blocks in GA generation Gen_k.

Figure 6A portrays an example of GA blocks in the 2D RC space. We designate the GA blocks as GA_j (j=1, ..., 6), where j is set in an arbitrary order. The GA database might involve GA blocks that are taken from the same area of the RC space but from different iteration windows. We regard those GA blocks as different ones in the GA database. Consequently, the number of GA blocks in the GA database increases with progress of the iteration.

Figure 6

(A) GA blocks and recovery blocks in 2D RC space. Six GA blocks GA₁, ..., GA₆ are shown by frames of different colors. Its central GA zone is marked by a circle with the same color. CS, IS, and E zones in GA blocks are shown by rectangles of different tones. Broken-line frames represent recovery blocks. (B) Four recovery blocks (REC₁, ..., REC₄) and their recovery zones (REC zone 1, ..., 4) are selected from panel (A). Although more recovery zones are possible in panel (A), they are not discussed herein. (C) Reset recovery blocks REC₁, ..., REC₄.

Each zone in GA_i is expressed by its relative position to the central GA zone. The relative position is expressed using a relative index Λ: Λ=[Λ⁽^α⁾, Λ⁽^β⁾, Λ⁽^γ⁾, ...] and Λ⁽^h⁾=–1, 0, 1 (h=α, β, γ, ...) if the block GA_i is an inside block. For a non-inside block, Λ_j ranges as Λ⁽^h⁾=–1, 0 or Λ⁽^h⁾=0, –1. It is noteworthy that a subzone S in a zone Λ is expressed by its relative position in the zone (Fig. 1B). Then, we specify the subzone by [S, Λ]. The SS count for the subzone [S, Λ] in GA_i is expressed as cGAiS,Λ.

Recovery zone

Next, we proceed to estimation of SS counts in IS and E zones without using procedures 2 and 3 in subsection “Introduction of subzone and subzone’s snapshot count (SS count)”. After the M-th iteration, we select an IS or E zone. One can assume that the selected zone is surrounded by N_surr zones. For instance, in the 2D RC space, N_surr=8, 5 or 3 for an inside, side, or corner zone, respectively (Fig. 5A), and for the 3D RC zone, N_surr=26, 17, 11, or 7 for an inside, face, side, or corner zone, respectively (Fig. 5B). If the selected IS/E zone is surrounded by more than N_surr/2 CS zones, then this selected zone is designated as a recovery zone (REC zone). Figure 6B presents examples of four REC zones, where REC zones 1, 2, and 3 are surrounded respectively by six, six, and five CS zones of eight zones (N_surr=8). REC zone 4 is surrounded by four CS zones of five zones (N_surr=5). A compact block consisting of the REC zone and zones surrounding the REC zone is called a recovery block (REC block) if a selected zone is judged as a REC zone. The i-th REC block is denoted by REC_i. Similarly to GA blocks, REC blocks are classified into some types: inside, side, and corner blocks for a 2D RC space, and inside, face, side, corner blocks for a 3D RC space.

In fact, more REC zones can be taken than those shown in Figure 6A. Note that the REC blocks can overlap in the RC space: REC₂ overlaps REC₁ and REC₃ in Figure 6A. We also note that the SS counts, c^[1;^M^](S, L), in a REC block are computed only from the iteration window [1;M] (Eqs. 4 and 5), in contrast to those in a GA block, which are computed from various iteration windows [1;1], ..., [1;M].

Resetting REC block

We use indices [S, Λ] to specify subzones in a REC block, as well as those in a GA block, and refer to an SS count, c^[1;^M^](S, Λ), in REC_i as cRECiS,Λ. One should recall that the REC zone in all REC blocks are IS or E zone (Fig. 6B). Now we reset the SS counts cRECiS,Λ of REC_i as cRECiS,Λ→1 (∀S; Λ=[0, 0, 0]). For that reason, the REC zones become E zones (Fig. 6C). The SS counts in this reset REC zone are estimated using the GA procedure described below.

Similarity of spatial patterns of SS counts between of REC_i and GA_j

To assess similarity of the SS counts between blocks REC_i and GA_j, we introduce a correlation coefficient c_cor(REC_i, GA_j):

ccorRECi,GAj= <RECi×GAj>-<RECi><GAj>SDRECi×SDGAj,

(16)

where

<RECi>=1NSΛ∑Λ∈RECi∑S∈ΛcRECiS,Λ,

(17)

<GAj>=1NSΛ∑Λ∈GAj∑S∈ΛδRECi(S,Λ)cGAj(S,Λ),

(18)

SDRECi=1NSΛ∑Λ∈RECi∑S∈ΛcRECiΛ,S2-<RECi>21/2,

(19)

SDGAj=1NSΛ∑Λ∈GAj∑S∈ΛδRECi(S,Λ)cGAj(S,Λ)-<GAi>21/2,

(20)

<RECi×GAj>=1NSΛ∑Λ∈RECi∑S∈ΛcRECiΛ,ScGAjΛ,S,

(21)

NSΛ=∑Λ∈RECi∑S∈ΛδRECi(S,Λ),

(22)

δRECiS,Λ=1 (If cRECiΛ,S≠0)0 (else).

(23)

The double summations ΣΛ∈RECiΣS∈Λ and ΣΛ∈GAjΣS∈Λ are taken respectively over all subzones in REC_i and GA_j. In Eq. 21, the summation ΣΛ∈RECi operates not only to cRECiΛ,S but also to cGAjΛ,S: The same relative positions of Λ in GA_j and REC_i are summed in a coordinated manner.

We convert c_cor(REC_i, GA_j) to a score function E_score(REC_i, GA_j) as

EsimiRECi,GAj= 1ccorRECi,GAj-1 (if ccorRECi,GAj>csh)1csh-1 (if ccorRECi,GAj≤csh).

(24)

The delta function δRECiS,Λ is used in c_cor(REC_i, GA_j). Therefore, a non-zero cGAjΛ,S in GA_j is eliminated from the computation of E_simi if cRECiS,Λ=0. Furthermore, the REC zone of REC_i is always eliminated from E_simi because δRECiS,Λ=0 for Λ=[0, 0, 0]. Parameter c_sh is introduced to avoid a negative value of E_simi by a negative correlation of c_cor(REC_i, GA_j). Consequently, c_sh can be set to a small positive value (c_sh=0.01 for instance).

One may replace the SS counts of the REC zone of REC_i by those of the central GA zone of GA_j if E_simi is smaller than a certain value. However, we assess the quality of GA_j in the next section to judge if the replacement is appropriate.

Overlap of SS-count patterns among zones in a GA block

This subsection explains a method to assess the quality of a GA block GA_j by introducing E_phys, which quantifies the spatial overlap of SS counts between overlapping subzones within the block. In fact, E_phys is a quantity expressed only by the SS counts in GA_j independent of REC_i.

We consider two subzones [S, Λ] and [S', Λ'] in GA_j. First, we introduce a δ function δ_ov(L, S; L', S'), which is used as a flag to detect overlapping subzones in GA_j:

δovS,Λ;S',Λ'=1 (S,Λ and S',Λ' are overlapping)0 (else).

(25)

For instance, δ_ov(L, S; L', S')=1 for any subzone pair taken from the shaded ones in Figure 1B.

Similarly to Eq. 9, the SS counts cGAjS,Λ can be scaled by a factor σ_Λ, and generate modulated SS counts as ρGAjS,Λ;σL=σΛcGAjS,Λ. Then, we define a function E_phys as

Ephys(GAj)=∑Λ∈GAjΛ≠0∑Λ'∈GAjΛ'≠0,Λ'≠Λ δovS,Λ;S',Λ'P(σL,σL'), 　

(26)

where

PσL,σL'=max⁡ρGAjS,Λ;σL,ρGAjS,Λ;σL'min⁡ρGAjS,Λ;σL,ρGAjS,Λ;σL'-1.

(27)

The summation ΣΛ'∈GAjΛ'≠0,Λ'≠Λ in Eq. 26 is taking pairs of different zones (Λ'≠Λ). The summations are taken with eliminating zone pairs related to the central GA zone (Λ≠0 and Λ'≠0). The number of zones in a GA block depends on the place where the block is taken from the RC space (Fig. 5). Therefore, the summation of Eq. 26 is taken over existing zones in the GA block. For instance, a surface GA block involves 18 zones in Figure 5B.

Although F(σ) (Eq. 10) was defined for the whole RC space, E_phys is defined for a single GA block. Finally, we minimize E_phys with modulating {σ_L} using a Monte Carlo scheme. The smaller the minimized E_phys(GA_j) becomes, the better the quality of the block GA_j is. The E_phys used in the next section is this minimized one.

First generation of GA

To start the GA procedure, the first generation, denoted as Gen₁, is set by taking GA blocks from the GA database. Now, we specifically examine block REC_i, which means that the SS counts in REC_i are determined by the GA operation. Then, we calculate E_simi(REC_i, GA_j) and E_phys(GA_j) for all the GA blocks in the GA database, and define the score function as

Escore(RECi,GAj)= ωsEsimi(RECi,GAj)+ωpEphys(GAj), 　

(28)

where ω_s and ω_p are weights set by the user, such as 1.0 and 2.0, respectively, as described in our earlier report [9]. The smaller the E_score is, the more similar the SS-count patterns of GA_j to REC_i and the better the overlap of SS counts among zones in GA_j.

Next, we sort the resultant scores {EscoreRECi,GAj, ∀j} in ascending order and collect NGAall GA blocks from the first score. Those NGAall GA blocks construct the first generation, denoted as Gen₁.

The first generation is prepared for each of the REC blocks. The size of a REC block depends on its place in the RC space, as described previously. We express the size of REC_i and GA_j as nαRECi×nβRECi×nγRECi×⋯ and nαGAj×nβGAj×nγGAj×⋯, respectively. Then, the GA blocks used in E_score can be expected to satisfy the following relations: nhRECi≤nhGAj (h=α, β, γ, ...).

GA generation update

Presuming that we have the k-th GA generation, denoted as Gen_k, with respect to a recovery block REC_i, then the generation update from Gen_k to Gen_k₊₁ is presented in Figure 7. Remember that the first generation of GA consists of NGAall blocks. This number NGAall is kept in all generations of GA. Next, one can select the NGAtop members from the first to NGAtop-th scored members from Gen_k (NGAall>NGAtop) and pass them to the next generation Gen_k₊₁, as shown in the red-framed rectangle in Figure 7. Consequently, NGAtop members in Gen_k₊₁ are taken from Gen_k without modulation.

Figure 7

GA process to update the k-th GA generation Gen_k to the (k+1)-th one Gen_k₊₁. Details for the process are explained in the text.

Next, we randomly select NGAmut members from Gen_k, which are denoted as MEMk(1), MEMk(2), ..., MEMk(NGAmut), and the same number of GA blocks from the GA database, denoted as GA⁽¹⁾, GA⁽²⁾, ..., GA(NGAmut). Then, we replace a part of MEMk(l) by a part of GA⁽^l⁾ (l=1,…,NGAmut). This procedure corresponds to mutation in GA. In Figure 7, the cyan-colored zones of a member in Gen_k are replaced by the cyan-colored zones in a GA block. Similarly, the yellow ones of a block in Gen_k are replaced by the yellow ones in a GA block. The mutated members are put in Gen_k₊₁.

Up to this point, NGAtop+NGAmut generation members are set for Gen_k₊₁. Here, we randomly select NGAexc(=NGAall-NGAtop-NGAmut) from Gen_k and generate NGAexc/2 pairs of the selected members: NGAtop and NGAmut should be set so that NGAexc is an even number. We denote the m-th pair by MEMk(m)(1)−MEMk(m)(2) (m=1,…,NGAexc/2), where superscripts (1) and (2) show that the two members are pairing. Then, a part of MEMk(m)(1) and a part of MEMk(m)(2) are exchanged. In Figure 7, two magenta parts are mutually exchanged. Two green parts are done too. The members generated by the exchange are put in Gen_k₊₁. This procedure corresponds to crossover or exchange in GA. Up to this point, NGAall members are prepared for Gen_k₊₁. Details for mutation and exchange are presented later.

To generate Gen_k₊₂, we repeat similar procedures to Gen_k₊₁ as done for Gen_k: (1) Calculate the score function between each REC block (say REC_i) and all NGAall members in Gen_k₊₁ (not all blocks in the GA database):

EscoreRECi,MEMj,k+1= wsEsimi(RECi,MEMj,k+1)+wpEphys(MEMj,k+1),

(29)

where MEM_j,k₊₁ stands for the j-th member (j=1,…,NGAall) in Gen_k₊₁. (2) The first to NGAtop scored members of Gen_k₊₁ are passed to the next generation Gen_k₊₂. (3) Randomly selecting NGAmut members fromGen_k₊₁ and NGAmut GA blocks from the GA database, mutations are executed. The generated NGAmut members are passed to Gen_k₊₂. (4) Selecting NGAexc members randomly from Gen_k₊₁, generate NGAexc/2 pairs of members. Then exchange is done in each pair. The resultant NGAexc members are passed to Gen_k₊₂. Up to this point, NGAall members are prepared for Gen_k₊₂.

We repeat this cycle N_cycle times, where N_cycle is set by the user. Finally, the SS counts of the REC zone in REC_i (i.e., the central zone of REC_i) are replaced by those of the central zone of the best-scored member of GenNcycle. The user might exit the cycle before reaching the N_cycle cycle if E_simi and E_phys of a member become smaller than the respective thresholds set by the user.

Multiple GA procedure

Figure 8A is a part of 2D RC space, where the magenta-colored circles are REC zones. The SS counts for these REC zones are estimated by the GA (Fig. 8B). Then, one might calculate Qcano[M]L and JL(i)[M] from these SS counts and then proceed to the (M+1)-th iteration. In Figure 8B, however, two zones indicated by cyan-colored circles emerge as REC zones anew. Therefore, we can advance the second GA procedure to estimate the cyan-colored zones. Figure 8C shows that the cyan-colored zones are determined. However, another REC zone (light-yellow zone) appears in Figure 8C. Therefore, we can multiply the GA procedure. When a new REC zone does not appear anymore, we quit the procedure. Then we proceed to the (M+1)-th iteration. Supplementary Figure S3 depicts the procedures used for the GA-guided mD-VcMD method as a flow chart.

Figure 8

Part of 2D RC space for which the axes are L⁽ⁱ⁾ (or λ⁽ⁱ⁾) and L⁽^j⁾ (or λ⁽^j⁾). Three zone types are represented by different tones. Colored circles are REC zones described in the text.

Mutation and exchange

We explain details for mutation and exchange. In mutation, a pair of MEMk(i) (i=1,…,NGAmut) and a GA block GA⁽ⁱ⁾ is treated. In exchange, a pair of MEMk(m)(1) (m=1,…,NGAexc/2) and MEMk(m)(2) is treated. In either operation, we select a slice from MEMk(i), GA⁽ⁱ⁾, MEMk(m)(1), and MEMk(m)(2). Imagine slices in a block as

Λ(h)=-1 +1 (h=α,β,γ,…).

(30)

Figure 9 presents an image of slices in the 3D RC space to help understanding of the slices. Each slice consists of nine zones.

Figure 9

Image of slices in the 3D RC space.

In mutation, a slice, say Λ⁽^x⁾, is selected randomly from slices of MEMk(i). Then, a slice parallel to Λ⁽^x⁾ is selected randomly from slice Λ⁽^x⁾=–1 or Λ⁽^x⁾=–1 of GA⁽ⁱ⁾. Then, the selected slice of MEMk(i) is replaced by that of GA⁽ⁱ⁾, which generates a new member for Gen_k₊₁.

In exchange, after selecting a slice randomly from slices of MEMk(m)(1), a slice is selected randomly from slices of MEMk(m)(2) in the same manner as that used for mutation. Then, the selected slices are mutually exchanged. This procedure generates two members for Gen_k₊₁.

No middle slice (Λ⁽^x⁾=0) is used for the mutation and exchange. The GA procedure is used to estimate the SS counts of a REC zone. The REC zone is involved in the middle slices of a REC block. To maintain logical consistency between the REC zone and the member in Gen_k, we eliminated the middle slice.

A technical point must be noted: In mutation and exchange, the size of the operated slice of a REC block should be included in or should be the same as that of a block in Gen_k. The size of the operated slice of the block in Gen_k should be included in or should be the same as that of a GA block.

Initial Conformations of Simulation

Evenly distributed initial conformations (EDIC)

The initial conformation of the first iteration can be any conformation for all the original, subzone-based and GA-guided mD-vcMD methods. However, to discover pathways from the unstable state to the most stable conformation (i.e., the native tertiary structure), the initial conformation should be far from the most stable conformation. Then, it is readily apparent that such sampling is free from knowledge of the native tertiary structure.

We have performed multiple runs for the mD-VcMD method to obtain the thermodynamically plausible distribution (the canonical distribution) of a biological system [10,11]. One can assume that iteration M, which consists of N_run runs, is finished. The initial conformation of the i-th run of iteration M+1 can be the last conformation of the i-th run of iteration M. The first iteration is an exception, where the initial conformations are a single conformation as described above.

A more appropriate method to prepare the initial conformations is the following: Imagine an ensemble of the last conformations of the first to M-th iterations. The ensemble comprises of N_run×M conformations. Supplementary Figure S4A schematically presents those conformations in the RC space, which is presented two-dimensionally for convenience. Then, we divide the RC space into small patches of equal size and count patches that involve one or more conformations. The patches are called “occupied patches.” A patch is not the virtual zone (ζ_L). Eighteen occupied patches are presented by shaded rectangles in Supplementary Figure S4. Some occupied patches may involve a single conformation. The others involve more conformations. Next, we select N_run patches randomly from the occupied ones, and select one conformation randomly from each of the selected patches. Red-filled circles are shown in the figure assuming that N_run=10. Those conformations are used for the initial conformations for iteration M+1. We divide the RC space into smaller patches if the number of occupied patches is smaller than N_run. However, if the number of occupied patches is considerably larger than N_run, then we increase the patch size. We designate this method as the “evenly distributed-initial-conformation (EDIC) method.”

Selectively distributed initial conformations (SDIC)

Another method can be used to set the initial conformations. It is simple and useful when a poorly sampled region emerged in the RC space. Assuming again that the M-th iteration has finished and assuming that an RC region was poorly sampled in all iterations, then the accuracy of Qcano[M]L for the poorly sampled RC region can be increased using a simple procedure: We then select conformations which are in or near the poorly sampled RC region, from trajectories done previously (iterations 1 to M). Then, the (M+1)-th iteration is done starting from those conformations.

This selective sampling procedure does not work if Qcano[M]L is calculated using the original method (Eq. S8 in SI) because this procedure increases QentireM(L) locally. This local inflation of QentireM(L) breaks the balance of QentireM, resulting in errors of QcanoM over a wide RC region.

By contrast, in the subzone-based and GA-guided methods, the artificial increase of QentireM(L) vanishes by fitting ρ[M]S,L;σLmin among overlapping zones. This lack of an artificial increase is an important advantage of the subzone-based and GA methods against the original method. We designate this method as “selectively distributed-initial-conformation (SDIC).”

We introduced a selective sampling procedure in an earlier report [8]. However, the selective sampling introduced herein differs from that described in that earlier report.

Thermodynamic Weight Assigned to Sampled Conformations

It is fundamentally important to assign a thermodynamic weight to each snapshot for analyzing the resultant conformational ensemble. Given a zone ζ_L, the thermodynamic weight assigned to ζ_L is Q_cano(L). If the simulation is at the M-th iteration, then QcanoL=Qcano[M]L. In the original mD-VcMD method, the weight assigned to the snapshot detected in ζ_L is proportional to Q_cano(L) if Q_cano(L)≈const is satisfied in the simulation because the number of snapshots detected in zones is approximately equivalent.

In the subzone-based and GA methods, however, sampling is flexible. For instance, according to the selective sampling proposed above, the number of snapshots in zones might be considerably uneven, which means that the thermodynamic weight assigned to a snapshot is not equivalent to Q_cano(L). We present a simple method for assigning weight: Consider a snapshot, as detected in ζ_L, and presume that this snapshot is the i-th one in the conformational ensemble. Then, the thermodynamic weight, w_i, assigned to the i-th snapshot is

wi=QcanoL/nsnap[1;M](L),

(31)

where nsnap1;M(L) signifies the number of snapshots detected in ζ_L in all iterations. Although w_i is dependent on L in Eq. 31, once the weight is assigned, we can forget L. If a relation nsnap1;ML≈const is satisfied, then w_i≈Q_cano(L) substantially.

Application to a Simple System

We earlier applied the original mD-VcMD method and a conventional Monte Carlo method to a model system [12]. Although the analytical conformational distribution of the system was not obtained because of the complexity of its potential energy function, both methods mutually converged. Importantly, the convergence from the original mD-VcMD was 200 times quicker than that from the conventional Monte Carlo method.

As described below, we check the efficiency of the GA-guided and original mD-VcMD methods using another model system for which the conformational distribution is obtainable analytically.

Model system

In the model, a single particle with coordinates expressed by [x, y, z], moves in a 3D space at a temperature on a potential-energy surface, E_model, defined as

Emodel(r)=D r2(r2-a)(r2-b),

(32)

where r is the distance of the particle from the origin [0, 0, 0]:

r2=x2+y2+z2.

(33)

Actually, E_model has a spherical symmetry around the origin in the 3D space. Parameters D, a, and b in Eq. 32 are set respectively to 0.045, 10, and 20. We set the temperature T to 1.5 or 3.0 for the simulations. Parameters D, a, b and temperature T are introduced as non-dimensional quantities in this application. Then, the Boltzmann factor is defined simply as exp[–E_model/T].

Originally, the RCs λ⁽^α⁾, λ⁽^β⁾, and λ⁽^γ⁾ were defined by the inter-atom-group distances in a molecular system (Section 1 of SI), and the motions of RCs were calculated from the motions of the molecular conformation. By contrast, in this application, λ⁽^α⁾, λ⁽^β⁾, and λ⁽^γ⁾ respectively correspond directly to x, y, and z without assuming the molecular system. The function E_model(r) is simple as shown in Supplementary Figure S5A of SI. This figure shows two energy basins at r=0.0 and r≈4.0, which are mutually separated by an energy barrier at r≈2.0. Supplementary Figure S6 demonstrates the potential energy in the 2D space (r²=x²+y²; z=0). We designate the energy basins at r=0.0 and r≈4.0 respectively as the “central basin” and “periphery basin.” The periphery basin is deeper than the central basin. The energy barrier at r≈2.0 is called “major barrier.”

We restrict the sampling region in a cubic box by setting the lower and upper limits to the coordinates as –3.0≤x≤3.0, –3.0≤y≤3.0, and –3.0≤z≤3.0. In the outside of the cubic box, E_model increases rapidly. By this restriction, this model system has nine energy basins in a strict sense. One is the central basin at the origin (x=y=z=0). The periphery basin at r≈2.0 in Supplementary Figure S5A is split into eight sub-basins near the edges of the cubic box. The energy barriers among the eight sub-basins are considerably lower than the major barrier. They are overcome easily during simulation at the examined temperatures. Therefore, we do not specifically examine differences among the eight sub-basins but specifically examine traversal between the central and periphery basins in this study.

The virtual zones are set as listed in Supplementary Table S2. We divided each of x-, y-, and z-axes equally into 11 virtual zones (ζ1(h), ..., ζ11(h); h=x, y, z). The width of each zone is 1.0, and the lower and upper limits for the axes are, respectively, –3.0 and 3.0.

The difficulty of sampling can be modulated by varying the system temperature. Supplementary Figure S5B shows the Boltzmann factor, exp[–E_model(r)/T], assigned to a microscopic state located at r. We note that the factor is not 4πr²exp[–E_model(r)/T]. Measuring from the most probable position (r≈2.0), the probability at the major barrier is about 10^–5 at T=3.0 and 10^–10 at T=1.5. This difference of the probability strongly influences the sampling efficiency of the original mD-VcMD method, as demonstrated later. We used a computer program myPresto/omegagene for the simulations [13].

Simulations

We moved the particle by a Monte Carlo (MC) simulation for both the original and GA-guided mD-VcMD simulations. The current simulation should be called “VcMC” because MC is done instead of molecular dynamics (MD). However, “VcMD” is used for convenience in this study. Step size Δr=[Δx²+Δy²+Δz²]^1/2 of the MC simulation was 0.02: Δr distributed evenly from 0.0 to 0.02 isotropically in the 3D space. The initial position of the particle for the first iteration was set at the origin [0, 0, 0]. We performed 30 runs in each iteration (N_run=30). The number of simulation steps for a run was 1×10⁷. The total simulation length for an iteration was 3×10⁸ steps. We used the EDIC method to prepare the initial conformations for iteration M (M≠1).

In the GA-guided method, when an iteration finishes, CS, IS, and E zones are detected. Then, REC zones are set and the SS counts in the REC zones are calculated to proceed into the next iteration. Generally, with the progress of the iterative simulation, the CS zones increases. Therefore, the IS or E zones decreases, which means that the REC zones decrease. Finally, when the IS or E zones vanished (i.e., the REC zones vanished), we switched the GA-guided method to the subzone-based method.

For the GA-guided VmMD method at T=1.5 and 3.0, respectively, 10 and 5 iterations were performed, whereas 13 and 5 iterations were done, respectively, for the original mD-VcMD at T=1.5 and 3.0. The interval for attempting inter-virtual state transition (Δτ; see section 4 of SI) is set to 10 steps of MC simulation for both the GA-guided and original VmMD methods.

To assess the sampling, we calculated correlation coefficients between the resultant Qcano[M](L) and the analytical Q_cano(L) denoted as QcanoanalL:

Ccornornal(M)=<f[M]×fanal>-<f[M]><fanal>SDf[M]×SDfanal,

(34)

where

fM(L)=QcanoM(L),

(35)

and

fanal(L)=Qcanoanal(L).

(36)

The terms in Eq. 34 are given as

<f[M]>=1Nsite∑Lf[M]L,

(37)

<fanal>=1Nsite∑Lfanal(L),

(38)

<f[M]×fanal> = 1Nsite∑Lf[M]L×fanalL,

(39)

SDf[M]=1Nsite∑Lf[M]L2-<f[M]L>21/2,

(40)

and

SDfanal=1Nsite∑LfanalL2-<fanal>21/2.

(41)

Therein, N_site represents the number of virtual zones in the system (N_site=11×11×11). Supplementary Figure S5B shows that the Boltzmann factor in the periphery basin (r≈4) is considerably larger than that of the other regions at both temperatures. Therefore, Ccornornal is determined mostly by QcanoM in the periphery basin, ignoring the other regions.

To assess the overall shape of Q_cano(L) in the correlation calculation, we defined another coefficient with setting g^[^M^] and g^anal as

gM(L)=log10QcanoM(L),

(42)

and

ganal(L)=log10Qcanoanal(L).

(43)

Then, the new correlation coefficient is defined as

Ccorlog10(M)=<gM×ganal>-<gM><ganal>SDgM×SDganal.

(44)

The terms involved in Eq. 44 are defined by the same equations as Eqs. 37–41 with replacement of f^[^M^]→g^[^M^] and f^anal→g^anal.

Figures 10A and 10B demonstrate the correlation functions as a function of iteration No. M calculated respectively from the GA-guided mD-VcMD at T=1.5 and 3.0. Figures 10C and 10D respectively show those from the original mD-VcMD at T=1.5 and 3.0. Figures 10B and 10D indicate that both the GA-guided and original methods provided a quick convergence of QcanoM to Qcanoanal at T=3.0, which means that both methods are useful when the free-energy barrier is low.

Figure 10

Correlation coefficients, Ccornornal[M] and Ccorlog10[M] as a function of iteration ordinal number M between analytical Qcanoanal and the simulated Qcano[M]. Red and blue lines respectively represent Ccorlog10 and Ccornornal for all panels. (A) and (B) Correlation coefficients between Qcano[M] from the GA-guided method and Qcanoanal at T=1.5 and 3.0, respectively. (C) and (D) Those between Qcano[M] from the original method and Qcanoanal at T=1.5 and 3.0, respectively.

At the lower temperature (T=1.5), the free-energy barrier became considerably higher than that at T=3.0. For the GA-guided method, QcanoM converged again quickly to Qcanoanal: Even in iteration 3, Ccornormal(3)=0.94 and Ccorlog103=0.94. For the original method, it is interesting that QcanoM converged to Qcanoanal once: Ccornormal(3)=0.99 and Ccorlog10(3)=0.99. However, Ccornormal and Ccorlog10 started decreasing for M≥4. Consequently, if we can quit the sampling at iteration 3, then the original method is also a good method, even at T=1.5. However, determination of the optimal number of iterations is not straightforward in general because the analytical Qcanoanal(L) is unknown a priori for a molecular system.

Figure 11 displays the spatial patterns of QcanoM calculated from the simulations as well as the analytical Qcanoanal at T=1.5. The analytical one has spherical symmetry around the origin (Fig. 11A). Figure 11B shows that Qcano10 from iteration 10 of the GA-guided method is similar to the analytical one: Ccornormal(10)=0.96 and Ccorlog10(10)=1.00. Figure 11C is Qcano3 from iteration 3 of the original method, which is also similar to the analytical one, whereas the distribution shape around the central basin is somewhat deviated from the spherical shape. By contrast, Figure 11D shows Qcano30 from iteration 30 of the original method, which is largely deviated from the analytical one. At T=3.0, the special patterns of QcanoM were similar to Qcanoanal at iteration 3. The patterns were not broken in further iterative simulations (data not shown).

Figure 11

Spatial patterns of Q_cano(L) (i.e., dependence of Q_cano on L) at T=1.5. Although three sides of the box are parallel to L⁽^x⁾-, L⁽^y⁾-, and L⁽^y⁾-axes, because potential energy E_model has spherical symmetry, we do not specify the axes. (A) Patterns for analytical Qcanoanal, (B) those from the 10-th iteration of the GA-guided method, (C) those from the third iteration of the original method, and (D) those from the 30-th iteration of the original method. Red and blue contours respectively correspond to log₁₀(Q_cano(L))=–3 and –9, where Q_cano is normalized as Σ_LQ_cano(L)=1.

SDIC method to assess the GA-guided mD-VcMD method

Lastly in this application section, we assessed the SDIC method by assuming a situation that the central region (i.e., the central basin) was insufficiently sampled during preceding iterations. Note that the SDIC method is not applicable to the original mD-VcMD method, in theory.

In the subsection presented above, we obtained Qcano10 from the 10-th iteration of the GA-guided method. To test the SDIC method efficiency, we generated Qcanomodi(L) by modifying Qcano10(L) as follows: Qcano10(L) for zones in the central region (L⁽^h⁾=4, ..., 8; h=x, y, z) was replaced by a small positive value ε. This procedure assumes that the central region was not sampled or poorly sampled in iterations 1, ..., M. The white region in Supplementary Figure S7 of SI is the central region. Actually, ε should be neither zero nor negative because such a value of ε results in an infinite or negative IVT probability JL(i)[M] (Eq. 13). Here, we set ε by the smallest value of Qcano10(L) at the top of the major barrier. The Qcano10(L) for the shaded region of Supplementary Figure S7 was taken from the original Qcano10(L). Then QcanomodiL is defined as

QcanomodiL=ε for zones in the central regionQcano10L for outside of the central region.

(45)

Subsequently, we performed the 11-th iteration of the GA-guided method at T=1.5 using Qcanomodi. We denote Q_cano resultant from the 11-th iteration as Qcano11. The number of runs and the simulation length of a run are the same as those for the iterations done earlier (N_run=30 and 1×10⁷, respectively). All of the 30 runs started from a single conformation at the origin (red filled circle in Supplementary Fig. S7), which is involved in the central region. In other words, the initial conformations were selected from the poorly sampled region. This is a simple example of SDIC.

Results show high correlation coefficients between Qcano[11]L and QcanoanalL: Ccornormal=0.97 and Ccorlog10=1.0. However, it is noteworthy that the coefficients between QcanomodiL and QcanoanalL were also high: Ccornormal=0.96 and Ccorlog10=0.97 because the central region is small compared with the whole region: 53113=0.094. In other words, the correlation coefficients obtained as presented above was originated from outside of the central region (shaded region in Supplementary Fig. S7).

Therefore, we re-calculated the correlation coefficients locally taking zones with L⁽^h⁾=3, ..., 9 (∀h), as shown two-dimensionally by the region surrounded by the cyan-colored line in Supplementary Figure S7. The volume ratio is: 5373=0.36. The local correlation coefficients are denoted as Bcornormal and Bcorlog10. By this calculation, we can check not only the accuracy of Qcano[11]L in the central region but also the connection of Qcano[11]L between the central and outside regions. The resultant local correlations between QcanomodiL and QcanoanalL were Bcornormal=0.74 and Bcorlog10=0.11. In contrast, those between Qcano11L and QcanoanalL were Bcornormal=0.98 and Bcorlog10=0.99. This increment of Bcornormal and Bcorlog10 manifests that the central region (the poorly sampled region) was sampled appropriately using the SDIC method without breaking the entire shape of Q_cano(L). This means that Q_cano(L) was connected correctly between the central and outside regions.

Closing Remarks

Benefits of the subzone-based and GA-guided methods are the following: (1) All snapshots from all iterations are useful for analysis: no production run is necessary, or all simulations are production run. By contrast, in the original mD-VcMD method or other enhanced sampling methods, the production run is mandatory usually, and only the production run is used for analysis. (2) To proceed with sampling, no fine tuning of a weight function is used, which is also a benefit in comparison to multicanonical or other adaptive umbrella sampling methods. (3) The statistical weight w_i (Eq. 31) is found after sampling. The method used to compute w_i is simple. (4) Statistical significance of a poorly sampled region can be increased by selective sampling. In the original mD-VcMD method, the selective sampling accumulates local errors in Q_cano(L), which results in slow convergence.

We applied the GA-guided and original mD-VcMD methods to the model system. The model system simplicity enables control of the sampling problem difficulty by tuning the height of the free-energy barrier (i.e., by varying the simulation temperature). At the higher temperature, both the original and GA-guided methods worked well, showing quick convergence of Q_cano(L) to the analytical one. However, at the lower temperature, only the GA-guided method showed quick convergence. Furthermore, we demonstrated that the SDIC method worked well to provide Q_cano(L) in the poorly sampled region without breaking the global shape of Q_cano(L) in the RC space.

GA-guided mD-VcMD is applicable to any complicated biomolecular system for which the potential energy is defined for an MD simulation. We generated computer programs for the GA-guided mD-VcMD method, and applied them to molecular binding of a flexible medium-sized ligand (a peptide of about ten amino-acid residues long) to a receptor protein in an explicit solvent [9]. The resultant conformational ensemble produced a binding free-energy landscape in which various structural clusters (the unbound state, fragile encounter complexes, and the most stable native-like complex) are distributed. This result demonstrated that the GA-guided mD-VcMD simulation sampled the wide conformational space.

Conflict of Interest

All authors declare that they have no conflict of interest.

Author Contributions

All authors contributed equally to this study.

A preliminary version of this work was deposited in the arxiv (2006.06950v2) on August 13, 2020.

Acknowledgments

We deeply appreciate Prof. Haruki Nakamura from Osaka Univ. for insightful comments. J. H. was supported by JSPS KAKENHI Grant No. 16K05517 and by the Development of Core Technologies for Innovative Drug Development Based Upon IT from Japan Agency for Medical Research and Development (AMED). N. K. was supported by a JSPS KAKENHI Grant (No. JP20H03229). K. K. was supported by JSPS KAKENHI Grant No. 16K18526. This work was supported as an HPCI System Research Project (Project IDs: hp190017, hp190018, hp190027, hp200063, hp200090, and hp200025). Simulations were performed on the TSUBAME3.0 supercomputers at the Tokyo Institute of Technology, in part under the Cooperative Research Program of the Institute for Protein Research, Osaka University, CR-19-05 and CR-20-05.

References

Corresponding author

Correction information

Register with J-STAGE for free!