Biophysics and Physicobiology
Online ISSN : 2189-4779
ISSN-L : 2189-4779
Regular Article
Exploring hydrophilic sequence space to search for uncharted foldable proteins by AlphaFold2
Naoki TomitaHiroki OnodaLeonard M. G. ChavasGeorge Chikenji
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML
Supplementary material

2025 Volume 22 Issue 1 Article ID: e220005

Details
Abstract

Proteins typically fold into unique three-dimensional structures largely driven by interactions between hydrophobic amino acids. This understanding has helped improve our knowledge of protein folding. However, recent research has shown an exception to this idea, demonstrating that specific threonine-rich peptides have a strong tendency to form β-hairpin structures, even in the highly hydrophilic amino acid sequences. This finding suggests that the hydrophilic amino acid sequence space still leaves room for exploring foldable amino acid sequences. In this study, we conducted a systematic exploration of the repetitive amino acid sequence space by AlphaFold2 (AF2), with a focus on sequences composed exclusively of hydrophilic residues, to investigate their potential for adopting unique structures. As a result, the sequence space exploration suggested that several repetitive threonine-rich sequences adopt distinctive conformations and these conformational shapes can be influenced by the length of the sequence unit. Moreover, the analysis of structural dataset suggested that threonine contributes to the structural stabilization by forming non-polar atom packing that tolerates unsatisfied hydrogen bonds, and while also supporting other residues in forming hydrogen bonds. Our findings will broaden the horizons for the discovery of foldable amino acid sequences consisting solely of hydrophilic residues and help us clarify the unknown mechanisms of protein structural stabilization.

Significance

The importance of hydrophobic amino acids in protein folding is well-supported by various studies and has significantly contributed to our understanding of protein folding. However, this also significantly narrows our perspective when identifying foldable sequences. This research highlights the potential for entirely hydrophilic sequences to adopt stable conformations, opening new avenues for discovering protein sequences with previously unrecognized structural and functional properties.

 Introduction

Proteins are polymers of amino acids that adopt unique tertiary structures. The folding process is primarily driven by hydrophobic interactions among hydrophobic residues. Various studies have provided substantial evidence to support this view. One example is calorimetric research that indicates the temperature dependency of the free energy of protein folding resembles that of the free energy of transfer of nonpolar model compounds from water into non-polar media, particularly concerning the cold denaturation of proteins [1]. Additionally, it is generally accepted as empirical evidence that the hydrophobicity of residues in the cores of globular proteins is strongly conserved and more closely correlated with their structure than with other types of interactions [24]. The study by Jacobsen and Linderstrøm-Lang further suggested that electrostatic interactions are not a principal force of protein folding [5]. Building on these findings, Dill highlighted that the packing of non-polar groups leads proteins to fold into their unique structures [6]. Additionally, Uversky et al. confirmed that highly hydrophilic proteins alone cannot adopt distinctive structures [7]. Hence, hydrophobic interactions among hydrophobic amino acids are deemed critical in the formation and stabilization of native protein structures.

On the other hand, Hu et al. disclosed a notable finding: some peptides rich in threonine, though extremely hydrophilic and almost lacking hydrophobic residues, can form stable β-structures, providing counterexamples to the traditional emphasis on hydrophobic core [8]. This observation suggests that the number of foldable amino acid sequences in hydrophilic sequence space may be larger than previously expected. Demonstrating the existence of completely hydrophilic and foldable amino acid sequences could reveal new opportunities for discovering novel foldable amino acid sequences and enhance understanding of protein folding through previously unexplored mechanisms.

In this study, we systematically explore the repetitive amino acid sequence space using AlphaFold2 (AF2) [9], focusing on sequences composed entirely of hydrophilic residues to evaluate their potential for structural formation, utilizing the advanced predictive capabilities of AF2. After this exploration, we performed molecular dynamics (MD) simulations for some predicted structures and assessed their ability to form the stable structure. Additionally, our analysis of structural data from the Evolutionary Classification Of Protein Domains (ECOD) database [10] investigates the role of threonine, helping to clarify the findings from our sequence space exploration. Throughout our sequence space exploration and MD simulations, we found that several repetitive amino acid sequences rich in threonine have a strong tendency to form various and likely stable β-solenoid structures, even in the absence of hydrophobic residues, with the structural shapes influenced by the length of the sequence unit. Our database analysis suggested that these results are due to the ability of threonine to form structural cores that support the hydrogen bonding formation of other residues.

 Materials and methods

 Selecting hydrophilic amino acids

In this study, we categorized threonine, glutamic acid, lysine, glutamine, asparagine, serine, histidine, arginine, aspartic acid, proline, and glycine as hydrophilic amino acids. This classification follows the hydropathy index defined by Kyte and Doolittle [11], which measures the relative hydrophilicity or hydrophobicity of amino acids. We selected amino acids with negative hydropathy index values, excluding tyrosine and tryptophan. These two amino acids were excluded because the benzene group in them has strong hydrophobic properties, so they should be considered amphiphilic rather than hydrophilic.

 Generating amino acid sequences for sequence space exploration

This study focuses on the use of only 11 hydrophilic amino acids. Despite this limitation, the number of possible patterns for sequences of N residues is 11N, making it challenging and time-consuming to predict the structure of all sequences for a typical protein domain of size N using AF2. Moreover, identifying foldable sequences by randomly selecting from this vast sequence space is anticipated to be highly difficult. Consequently, instead of searching the entire sequence space, this study focuses on sampling a more restricted subset of sequence spaces. To restrict the sequence space to be explored, sequences are randomly generated under the two conditions described below.

 Condition 1: Repetitive sequences of a short sequence pattern:

One strategy to limit the sequence space to be explored is to use repetitive sequences of short sequence patterns. The length of repeated sequences in this study was set to range from 3 to 10 residues. Each unit was repeated to achieve a total sequence length of approximately 150 residues.

 Condition 2: Random sequence generation by biased amino acid probability:

Sequence patterns with lengths ranging from 3 to 10 residues were generated randomly. Threonine-rich sequences were produced to assess whether sequences with higher threonine content have a greater ability to fold than others. Here, a threonine-rich sequence is defined as one where threonine occurs at a high probability, while the other 10 amino acids occur at equally low probabilities. In this study, when equalizing the proportions of amino acids other than those intended to be abundant, the proportion of the abundant amino acids must be more than 10%. However, excessively increasing the proportion of the abundant amino acids may compromise the diversity of amino acid sequences in the explored sequence space. Taking these two considerations into account, we adopted a value of 30% as a suitable option where the proportion of abundant amino acids is sufficiently distinct from that of the other amino acids (7%), while still allowing for the investigation of diverse sequences. For comparison, X-rich sequences were similarly generated, with X representing each of the other 10 amino acids. Additionally, sequences with equal occurrence frequencies for all 11 amino acids served as a control.

Employing these 12 sequence generation methods, 200 sequences were generated for each of 8 different lengths (3 to 10 residues), yielding a total of 19,200 sequences (12 methods×8 lengths×200 sequences).

 Structure prediction with AF2

For each of the 19,200 sequences, protein structure predictions were performed using AF2 (LocalColabFold ver.1.5.2 [12]) in single-sequence mode. We set the recycling number to 12 considering the contribution of the large numbers of recycling to the confidence and accuracy of the prediction [9] and our necessity of compromising on that due to the computational cost limitation. Sequences were deemed reliable to fold into a unique structure, if their predicted structures exhibited an average predicted Local Distance Difference Test (pLDDT) score exceeding 90.0 and a predicted Template Modeling (pTM) score above 0.6. Sequences meeting these two criteria were designated as Predicted Foldable Sequences (PFSs). The ratio of the number of PFSs to the total number of sequences for each sequence generation methods (8×200) was then calculated.

 Detailed configuration of the MD simulation

In this study, to validate the PFSs’ abilities to form stable structures, we performed the MD simulations. For these simulations, we selected 50 PFSs with the highest pLDDT values (see Supplementary Table S1) to test the structural stabilities of as many PFSs as possible within the computational cost limitation. The simulation for each PFS’s predicted structure is carried out utilizing the program GROMACS [13], the force field CHARMM27 [14], and the water model TIP3P [15], following the steps described below.

First, the system was constructed with the predicted structure, water molecules, and ions (Na+ or Cl) added to neutralize the charges in the predicted structure. The volume of the system was set so that the predicted structure and the all edges of the system were each 10 Å apart. Second, the constructed system was relaxed through the energy minimization. Third, the equilibration of the built system was conducted under NVT and NPT ensemble conditions during 100 ps at 300 K, respectively. When conducting the equilibration under NPT ensemble condition, the pressure in the system was set to 1 atm. Finally, we carried out the simulation during 100 ns at 300 K under NPT ensemble condition with the same pressure as the equilibration under NPT ensemble condition.

 Classifying the shapes of predicted structures

After the sequence space exploration, the shapes of the predicted structures of the PFSs were analyzed. For this purpose, we first performed secondary structure assignments using the STRIDE program [16]. As described in the Results and Discussion section, many of the predicted structures were β-helices, a tandem repeat structure formed by the association of parallel β-sheet in a helical pattern. The shape of a β-helix is characterized by the number of strands per turn (denoted as N), which we assume can be calculated using the following equation:

  
N = Δ S n (1)

where ΔS stands for the sequence separation between residue pairs that are hydrogen-bonded in the main chain (nearly constant in β-helices), and n is the length of the repetitive sequence unit. Based on this calculation, we classified the shape of β-helices as ‘sandwich’ (N=2), ‘triangle’ (N=3), ‘square’ (N=4), ‘pentagon’ (N=5), and ‘circle’ (N≧6). Structures primarily composed of antiparallel β-sheets were categorized as ‘meander.’ Structures dominated by α-helices were classified as ‘α-helix rich.’ The validity of this classification scheme was manually verified by visually examining a considerable number of predicted structures.

 Composing the dataset of experimentally determined structures

As detailed in the Results and Discussion section, according to our computational sequence space exploration, threonine-rich sequences have a greater ability to fold than others. To demonstrated that this result can be interpreted using statistical data from our database analysis, we constructed and analyzed a structural dataset. Domain structures were sourced from the ECOD database (development version 288), filtered at a 40% sequence identity threshold to minimize the redundancy of this dataset without compromising its diversity. In addition, we selectively included X-ray structures with an average B-factor below 50 Å2 and a resolution finer than 2.0 Å, criteria ensuring the inclusion of high-quality, accurately resolved structures conducive to reliable comparison. The resulting dataset consisted of 8,037 non-redundant domain structures.

 Analysis of the dataset

In our dataset analysis, we evaluated the extent of hydrogen bond formation and non-polar atom contacts within the structures taking into account the environments of each residue as these factors significantly influence structural stability. These evaluations were performed using the four calculations described below.

 Calculation 1: Identification of buried residues in the structures

One of the key indicators of the environment of a single residue in a protein structure is the degree of burial (or solvent exposure). We computed the Half-Sphere Exposure (HSE) of each residue in our dataset using Hamelryck’s method [17]. An HSE serves as a good indicator of how deeply each residue is buried. It is obtained by counting the number of Cα atoms within a 12 Å hemisphere centered at the Cα atom of a given residue. This hemisphere was created by dividing a sphere centered with a plane perpendicular to the Cα-Cβ vector. Of the two hemispheres, the one containing the Cβ atom of the given residue was used (see Figure 1A).

Figure 1  (A) Definition of the hemisphere to identify an HSE. (B) Distribution of HESs for all residues across all structures in the dataset. Threshold for identifying buried residue is set to 15 (highlighted in black).

Figure 1B illustrates the distribution of HSEs calculated for the all residues across all structures in the composed dataset, showing the couple of peaks. We determined a threshold value of 15 for the HSE to distinguish between buried and exposed residues. The rationale behind this choice is that, when the peak with a contact number of 0 is excluded, this value provides the optimal separation between the right-side and left-side distributions. Consequently, residues with HSEs greater than 15 were classified as buried residues.

 Calculation 2: Detection of hydrogen bond formation

For assessing hydrogen bond formation in each structure, we utilized the program HBplus [18] which identifies the pair of donor and acceptor atoms involved in the hydrogen bonds.

 Calculation 3: Estimating the tightness of non-polar atom packing

To quantify the tightness of non-polar atom packing which each structure forms, one of the measure of protein structure stability, we used the program NACCESS [19] which calculates the values of Relative Solvent Accessibility (RSA) of each atom in a molecule [20]. Using the results obtained from this program, we calculated the values of “Relative buried surface of non-polar atoms”, which correlates with atom packing efficiency. It was defined by the following formula:

  
“Relative buried surface of non-polar atoms” = 100 1 N n o n p o l a r i = 1 N n o n p o l a r R S A i [ % ] (2)

Here, Nnonpolar denotes the total number of non-polar atoms in a structure and RSAi refers to the RSA value of the i-th atom.

 Calculation 4: Normalized Pointwise Mutual Information (NPMI) in the hydrogen bond network among side chains of structures from ECOD

In this research, we calculated the NPMI value for each amino acid pair to quantify their tendency to form hydrogen bonds with their side chains, allowing for a comparison of these tendencies within a specific range. The value of NPMI of amino acid pair (A, B) is calculated by the formula below:

  
N P M I ( A , B ) = log f ( A , B ) f ( A ) f ( B ) log f ( A , B ) (3)
  
f ( A , B ) = N A B + λ N p a i r (4)
  
f ( A ) = N A + λ N s i n g l e (5)

where f(A) represents the frequency distribution of the amino acid A engaged in hydrogen bond formations while f(A,B) denotes the frequency of the amino acid pair (A, B) that forms hydrogen bonds through their side chains; Nsingle and Npair correspond to the total number of residues involved in hydrogen bond formation and the total number of residue pairs whose side chains form hydrogen bonds, respectively; NA denotes the number of amino acid A residues involved in hydrogen bond formation and NAB represents the number of the amino acid pairs (A, B) forming hydrogen bonds via their side chains; λ is a pseudo count introduced to prevent the value of NPMI(A,B) from divergence. In this paper, Nsingle, Npair, NA and NAB were calculated using the results from HBplus and the value of λ was set to 0.01.

NPMI(A,B) takes values between 1 and –1. If the value of NPMI(A,B) is close to 1, it indicates that the side chains of the amino acid A and B have the strong tendency to form the hydrogen bond networks. Conversely, a value close to –1 indicates a tendency for the amino acids to avoid interacting with each other.

 Results and discussion

 Threonine exhibits a strong tendency to stabilize β-strands

We initially confirmed the ability of the threonine-rich (T-rich) sequences to adopt a unique conformation. To provide the basis for comparison, we also assessed the ability to fold for other X-rich sequences and those without amino acid frequency bias using the methods outlined in the Materials and methods section. Figure 2A shows the percentage of PFSs out of all the sequences generated. This figure illustrates the T-rich sequences have much more sequences to fold than the other sequences, demonstrating the unique nature of T-rich sequences, which can fold into unique structures even though they are composed solely of hydrophilic amino acids.

Figure 2  (A) Proportions of reliable structures in the generated sequences across different amino acid content rate conditions. (B) Percentage of residues forming each secondary structure within all predicted structures of PFSs identified from threonine-rich sequence space exploration.

It would be interesting to investigate what structure T-rich sequences composed solely of hydrophilic amino acids would fold into. To gain insight into predicted structures, we performed a secondary structure assignment and analyzed the number of amino acid residues forming α-helices, β-strands, and loops as described in the Materials and methods section. Figure 2B depicts the percentage of the number of residues that are assigned to α-helices, β-strands, and loops out of the total number of residues of the T-rich PFSs. This figure shows that most of the residues (about as many as 80%) formed β-strands. This outcome suggests that threonine has a strong tendency to stabilize β-sheet structures.

 Relationship between sequence unit lengths and structural shapes

From the structural prediction of 1,600 T-rich sequences, we identified 207 PFSs. As shown in Figure 2B, the majority of these predicted structures are predominantly composed of β-sheets. A more detailed analysis of the hydrogen bond patterns within the main chains revealed that the most of these structures form highly regular and periodic parallel β-sheets. These periodic parallel β-sheets are characteristic of β-helices, which can be further classified based on the number of β-strands per turn. We classified β-helix structures and other structural types based on their hydrogen bonding patterns using the methods outlined in the Materials and methods section. Consequently, the predicted PFS structures were categorized as α-helix, meander, or β-helix, with β-helix structures further classified into sandwich, triangle, square, pentagon, and circle categories (see Figure 3A).

Figure 3  (A) Examples of structures categorized into each structural classes. All the structure shown here were obtained from structural predictions conducted under the threonine-rich conditions. (B) The number of sequences predicted to reliably form each distinct structural shape. The breakdown of each count is highlighted by different colors corresponding to each sequence unit length.

Furthermore, analysis of the predicted structures demonstrated a distinct relationship between sequence unit lengths and structural shapes (Figure 3B). For example, longer unit lengths (n=8–10) predominantly adopt the ‘sandwich’ shape, while shorter lengths favor polygonal shapes, such as sequences with a unit length of 6, which display a strong preference for the triangle shape. This observation suggests that the structural shapes of repetitive peptides can be controlled by adjusting the sequence unit lengths.

 Mechanism of structural stabilization by threonine

In this section, we examine why threonine uniquely stabilizes the folded structures and promotes the β-strand formation more effectively than other hydrophilic amino acids. To this end, we conducted a database analysis to characterize the statistical properties of threonine and other hydrophilic amino acids, demonstrating how these properties contribute to the stabilization of the predicted structure of the PFSs identified in this study. This investigation specifically focused on hydrogen bond formation and non-polar atom packing.

Firstly, we calculated the proportion of buried amino acids that form hydrogen bonds via their side chains (Figure 4A). This analysis revealed that the buried side chains of threonine and serine have the lowest propensity to form hydrogen bonds inside the structures. This finding suggests that these two amino acids have a higher tolerance for hydrogen bond dissatisfaction in the buried environment compared to other hydrophilic amino acids. It makes these amino acids more suitable for core formation. As illustrated in Figure 4B, the predicted structures of PFS presented in this study show numerous instances where threonine forms the core without satisfying hydrogen bonds. In contrast, other amino acids are less suitable for the core because of their strong tendency to satisfy hydrogen bonds. This characteristic restricts their ability to contribute to the core structure unless their side chain hydrogen bonds are fully satisfied.

Figure 4  (A) Proportion of buried amino acids satisfying hydrogen bonds by side chains. The ratio of buried residues to the total occurrences of each amino acid is provided below its notation. (B) Examples of threonine residues that do not satisfy hydrogen bonds and are buried within the predicted structure of a PFS. The overall structure of the PFS is depicted in stick representation, with threonine side chains colored green and hydrogen bonds illustrated as yellow dotted lines. (C) Matrix of NPMI quantifying the tendency of amino acid pairs to form hydrogen bonds with their side chains. Each axis represents different types of amino acids, and each cell is colored according to the value of NPMI. (D) Illustration of well packed structures of a natural protein (PDB ID: 4Q1Q) forming elaborate hydrogen bond networks inside structure. Threonine side chains are colored by green and hydrogen bonds are depicted with yellow dot lines.

Secondly, we investigated the propensity of each amino acid pair to form hydrogen bonds via their side chains (Figure 4C). The data indicate that four amino acids—threonine, serine, asparagine, and glutamine—exhibit primarily positive NPMI values with all nine hydrophilic amino acids containing polar amino atoms in the side chain, suggesting that these four amino acids are capable of forming side-chain-to-side-chain hydrogen bonds with other polar amino acids. In contrast, the remaining five amino acids—aspartic acid, glutamic acid, histidine, lysine, and arginine—show highly negative NPMI values depending on the counterpart amino acid, indicating that these amino acids have counterparts with which side-chain hydrogen bonding is particularly difficult to establish.

As described above, threonine and serine can exist within the interior of protein structures regardless of whether their hydrogen bonds of side chains are satisfied, having the ability to favorably form hydrogen bonds with any other amino acid side chain. Consequently, the side chains of these two amino acids can facilitate the hydrogen bond satisfaction of other amino acid side chains within the structure. This, in turn, enables other amino acids to contribute to the formation of the structural core. Based on these observations and considerations, it can be concluded that threonine and serine possess the unique ability to promote core formation involving only hydrophilic amino acid residues, including themselves. Indeed, this mechanism mirrors that of several naturally occurring β-solenoid proteins, which also similarly stabilize their tertiary structures, as demonstrated in Figure 4D.

As described above, we have outlined that threonine and serine are the most suitable for core formation among hydrophilic amino acids. However, as shown in Figure 2A, threonine-rich sequences demonstrate a markedly greater folding ability than serine-rich sequences. What accounts for this difference, given that the only structural difference between serine and threonine is the presence of a single methyl group? In the following analysis, we investigate the impact of mutating all threonines in the PFSs of threonine-rich sequences to serine on both the folded structures and their predicted conformations. First, we examined the effect of mutation on the tightness of non-polar atom packing, a key measure of stability (see the Materials and methods section). Figure 5A illustrates the tightness of non-polar atom packing in the predicted structures of threonine-rich PFSs compared to their corresponding mutants. The mutant structures were generated by simply removing the Cγ atom from threonine in the predicted structures of the threonine-rich sequences. In all cases, the original structures exhibited tighter non-polar packing compared to the mutants, with the mutants showing an average reduction in tightness of about 5%. Does this 5% reduction significantly impact stability or folding capability? To address this, we performed structure predictions on all mutants and compared their pLDDT scores with those of the original PFSs (Figure 5B). The analysis reveals that the pLDDT scores of the original PFSs consistently exceed 90, while those of the corresponding mutants generally fall below 60. In the term of structures, the most of mutants are predicted to adopt random coil formations while the PFSs are predicted to form well-packed structures (Figure 5C, 5D). This result suggests that the 5% reduction in tightness caused by the mutation was sufficient to lead to a loss of folding ability. This leads us to conclude that threonine is considerably more effective than serine in promoting structural stabilization.

Figure 5  (A) Comparison of the tightness of non-polar atom packing between the predicted structures from sequence space exploration under the threonine-rich condition and these predicted structures after the deletion of methyl groups in the side chains of threonine. (B) Comparison of pLDDT scores from structural predictions for reliable threonine-rich sequences versus sequences with all threonine residues replaced by serine. (C, D) A predicted structure of a PFS and a predicted structure of that with threonines replaced by serines. Each structure is depicted in stick representation. Threonine residues that form the structural core and serine residues corresponding to them are colored green.

Additionally, several previous studies corroborate threonine’s essential role in stabilizing β-structures. Midya et al. demonstrated that threonine side chains arranged in the β-sheet can form an ideal hydrogen-bond network with water molecules. Cγ atoms facilitate the formation of locally ordered, ice-like, low-density water molecules in the hydration layer through hydrophobic interactions, while hydroxyl groups integrate these molecules into a quasi-ice-like layered structure via hydrogen bonding [21]. Creamer & Rose also reported that β-branched amino acids, such as threonine, valine, and isoleucine, tend to form β-strands due to the entropic advantages of their side-chain conformations [22]. Collectively, these studies, along with our findings, suggest that threonine has a natural propensity to form β-strands without compromising structural stability, both within and outside of the core. Thus, threonine’s unique properties likely account for the high number of foldable sequences identified in our exploration of threonine-rich sequence space.

 Assessment of the foldability and stability of PFSs

In this study, we regarded the artificially generated amino acid sequences with high pLDDT as PFS because high pLDDT values are generally reliable indicators of foldability in natural proteins [9]. However, it remains uncertain whether this applies to non-natural sequences. Therefore, to examine PFSs’ foldabilities, all-atom MD simulations, a computational method distinct from AF2 (see Materials and methods section), were performed and we assessed the stability of PFSs’ structures. It should be noted that this approach relied on the correlation between foldability and stability [23,24]. MD simulations ran for 100 ns at 300 K on the 50 PFSs with the highest pLDDT scores, using AF2-predicted structures as initial structures. Stability was assessed using a simple method: A structure was classified as stable if the largest root-mean-square-deviation (RMSD) between the initial structures and the structures in the MD trajectory was less than 6 Å; those with the RMSD greater than 6 Å were classified as unstable. This RMSD threshold value was taken from the conclusion by Reva et al. [25]. Results of MD simulations showed that 23 of the 50 PFSs satisfied the stability criterion, while the remainder did not (Supplementary Figure S1, S2). These findings indicate that high pLDDT scores alone do not guarantee foldability. Nevertheless, approximately half of the PFSs demonstrated stability, supporting the hypothesis that some PFSs may be foldable. For example, the predicted structure of Sequence ID 8 appears to be extremely stable (see Supplementary Movie S1). Furthermore, even among those with RMSD greater than 6 Å, some sequences had large root-mean-square-fluctuation (RMSF) only near the N- and C-termini, with very small fluctuations in the middle (Supplementary Figure S3). For those sequences, adding stable globular domains at the termini [26], could enhance overall stability and create a foldable sequence. Experimental validation is the most reliable way to confirm the foldability of PFSs, which is our next challenge.

 Conclusion

This study explored the hydrophilic amino acid sequence space using AF2. We identified that threonine-rich repetitive sequences have a significantly higher likelihood of folding into unique structures compared to other sequences, with most of these folded structures forming β-helices. Moreover, the MD simulation partially supported this result, demonstrating that some of them are highly stable during 100 ns. These β-helices were further classified into sandwich, triangle, square, pentagon, and circle categories. Interestingly, the structural shapes of β-helices were found to be influenced by the length of the sequence units, while there is no clear correlation between the structural shapes and the contents of threonine in sequences (see Supplementary Figure S4). We explored why threonine-rich sequences uniquely stabilize folded structures and promote β-strand formation more effectively than other hydrophilic amino acids. Our findings indicate that the strong tendency of threonine to stabilize β-structures arises from two primary factors. First, threonine can form the structural core by leveraging its methyl group and its capacity to tolerate hydrogen bond dissatisfaction. This property allows threonine to participate in nonpolar atom packing, contributing to tightly packed structures. Second, threonine plays a critical role in facilitating hydrogen bonding between the side chains of various residues, enabling other polar amino acids to be included in the core without hydrogen bond dissatisfaction. This research provides valuable insights that could guide the discovery of sequences with novel structural or physicochemical properties. A logical next step would be to experimentally test these computational predictions by investigating the physical properties and determining the structures of the identified sequences.

 Conflict of interest

The author declares no conflicts of interest.

 Author contributions

N.T. and G.C. designed this research. N.T. performed calculations and analyzed data. N.T., H.O., L.M.G.C., and G. C. co-wrote the manuscript. H.O, L.M.G.C., and G. C. commented on the way to analyze data.

 Data availability

The data are available from the corresponding author upon reasonable request. Example files for each structural class are available on GitHub at the following repository: https://github.com/GeorgeChikenji/Exploring-hydrophilic-sequence-space-to-search-for-uncharted-foldable-proteins-by-AlphaFold2.

 Acknowledgements

This work was partially supported by the Platform Project for Supporting Drug Discovery and Life Science Research (Basis for Supporting Innovative Drug Discovery and Life Science Research; BINDS) from AMED under Grant Number JP23ama121001 (L.M.G.C.), JST ACT-X Grant Number JPMJAX22B5 (H.O.), and JSPS KAKENHI Grant Number 22H00406 (G.C.).

References
 
© 2025 THE BIOPHYSICAL SOCIETY OF JAPAN
feedback
Top