2025 年 25 巻 p. 79-89
Important fundamental questions remain to be answered in biology, particularly concerning the coordination of many proteins and overall stability of living organisms. To address these questions, we clarify—by analogy with phase diagrams in materials science—that a nucleotide composition plot (NC-plot) for genomes can be regarded as a phase diagram of life. The NC-plot for a genome considerably deviates from a completely random composition, and it provides insights into the mechanism by which genomic order emerges from random mutations. Furthermore, an analysis of the NC-plot for all genes in a genome reveals relationships between nucleotide composition and various protein properties, including differences in evolutionary rates and distributions of protein folds. Finally, an analysis of the NC-plot for viral genomes suggests that both the coordination of proteins and stability of organisms are determined by genome processing systems (i.e., intracellular factors).
The fundamental questions in biology are mainly concerned with the macroscopic properties of organisms. For example, while all proteins are encoded by genes within a genome, the mechanisms by which they achieve such a high degree of coordination are largely unknown. Moreover, it is unclear why a complete organism remains remarkably stable despite changes caused by mutations in individual genes being considered neutral [1,2,3]. We believe that these questions arise from our lack of deep understanding about the overall principles governing the genome—the blueprint of life. This situation becomes even clearer when compared with our understanding of physical matter [4]. For physical matter, a balanced development has been achieved between molecular-level understanding (i.e., statistical mechanics) and macroscopic understanding (i.e., thermodynamics). While our understanding of living organisms at the molecular level has advanced rapidly since the mid-20th century [5], our comprehension of macroscopic organisms has not experienced a similar progress. Since the beginning of the 21st century, whole-genome analysis has led to extensive research on many fields, including medicine, regarding the effects of individual genes on organisms. Nevertheless, no research field has been established for characterizing the properties of macroscopic organisms using a small number of parameters, akin to thermodynamics for characterizing materials [4].
To address persistent challenges in biology, we have been conducting studies on the examination of an entire genome sequence. As a result, we have compiled findings and, in 2024, published a book that summarizes our 40 years of research [4]. In this paper, we propose that essential properties of organisms can be captured with a small number of parameters through information analysis of genome sequences. First, we summarize the content of the mentioned book and introduce the phase diagram of life—a framework for characterizing essential macroscopic properties of organisms using a small set of parameters. Specifically, by using nucleotide composition plots (NC-plots) for genomes, which allow to visualize the composition of each codon position, we demonstrate that real genomes deviate significantly from a completely random composition and explain that this pattern corresponds to the phase diagram of life. Then, we analyze all proteins of a single species using the NC-plot for all genes in a genome (in this case, the human genome) and discuss characteristics of the proteins encoded within that genome, including the evolutionary rate and structural folds. Furthermore, by analyzing viral genome data, we aim to demonstrate that the determinant of the NC-plot is an intracellular factor.
2.1 The mystery of the coexistence of randomness in elementary processes and stability of states
A widely held view in biological sciences is that the immense complexity of living organisms makes it impossible to characterize an entire organism with only a few parameters. However, based on the similarities between matter and biological genomes, we hypothesize that a novel method of information processing for genomes enables the characterization of organisms’ essential properties using a limited set of parameters.
The first similarity lies in the elementary processes of change. Matter is composed of many molecules that move randomly while interacting with surrounding molecules. In other words, the elementary processes governing the state of matter are random in nature. Similarly, genomes are composed of DNA (deoxyribonucleic acid) sequences, and changes in these sequences are caused by the introduction of random mutations. Thus, genomes can be regarded as collections of random mutations. Overall, the elementary processes responsible for state changes in both matter and genomes are random.
The second similarity is that macroscopic states are extremely stable despite the randomness of elementary processes. For matter, when the random motion of molecules within a system reaches equilibrium, the system becomes highly stable, allowing to construct phase diagrams according to variables such as temperature and pressure. In living organisms, it is unclear whether random mutations within the genome of a species are in equilibrium. Nevertheless, it is now well-established that most mutations are neutral. This does not contradict the notion that random mutations introduced into a genome may be in equilibrium. Moreover, both individual organisms and species exhibit remarkable stability. Thus, from a macroscopic perspective, matter and living organisms are similar in that their states are stable.
Based on these similarities between matter and genomes, we aim to estimate the parameters that can be used to construct a phase diagram of life. For matter, typical parameters for defining states are temperature and pressure, which are intensive parameters. Generally, at least one of the parameters used in a phase diagram must be intensive; hence, a phase diagram cannot be constructed using only extensive parameters such as volume [6]. As temperature and pressure are intensive parameters in matter, they are suitable for phase diagrams.
We use the simplest intensive parameter derived from genome sequences, namely, the nucleotide composition. However, when DNA sequences are translated, they are converted into amino acids in codon units of three nucleotides. Statistically, the physicochemical properties of amino acids are biased depending on the position within a codon. Thus, we consider the nucleotide composition at each codon position to be the simplest set of intensive parameters for the phase diagram of life, resulting in a 12-dimensional plot [7]. We designate this graphical representation as the NC-plot.
2.2 Phase diagram of life depicted in NC-plot
Since the beginning of the 21st century, the complete genome sequences of many organisms have been analyzed. Similarly, using the genome sequences of 2664 species—including eubacteria, archaea, protists, fungi, plants, and animals—we obtained the corresponding NC-plots. We found that while the NC-plots were widely dispersed at the third codon position, all organisms were remarkably clustered in a narrow region at the first and second positions. Therefore, this narrow region within an eight-dimensional composition space can be called the habitable zone in the composition space [7], as it represents the range of compositions in which life can exist. The main characteristics of the habitable zone are as follows:
In practice, visualizing distributions in an eight-dimensional space is challenging. Hence, we calculate the distances (d₁ and d₂) between the actual nucleotide composition and completely random composition (where the proportion of each nucleotide is 0.25) using the eqs. (1) and (2). By reducing the information on the composition of the first and second nucleotide positions to single dimensions using these equations, we compress the eight-dimensional space into a two-dimensional representation [4].
| (1) |
| (2) |
The representation of the eight-dimensional plot in two dimensions clarifies the deviation of actual genomic composition from the completely random composition. Furthermore, when the overall GC content of the genome is biased, the deviation from the random composition becomes even clearer.
To further clarify the bias between the actual genomic composition and completely random composition, we can reduce d₁ and d₂ to a single dimension using eq. (3). The resulting histogram of D₁₂ exhibits a distinct sharp peak at approximately 0.2.
| (3) |
If nucleotides are generated randomly and their composition deviates by 0.2, the rarity of such event can be evaluated based on the corresponding standard deviation. Considering the four types of nucleotides, when a sequence of N nucleotides is generated randomly, standard deviation σ is calculated as follows:
| (4) |
In prokaryotic genomes, which typically contain approximately 106 codons, a standard deviation of approximately 4 × 10−4 is obtained. Hence, the observed deviation of 0.2 from the completely random composition corresponds to roughly 500σ. This indicates an extraordinarily rare event, equivalent to a probability of approximately exp(−125,000) [4]. Moreover, as each species occupies nearly a single point in the composition space, individual organisms represent even rarer entities in this compositional framework.
As emphasized by Schrödinger [8] and Monod [9] in the mid-20th century, living organisms exhibit remarkably low entropy. Based on the discussion above, even the position of a genome within a narrow compositional space is sufficient to explain the low entropy of life. Therefore, we refer to the region in the composition space where genomes can exist as the habitable zone [7]. We call this narrow region the habitable zone rather than the phase diagram of life because simply being confined to a narrow region is insufficient to constitute a phase diagram. In fact, for a process to be described as a phase diagram, the random process must reach equilibrium as an elementary process. To examine this, we analyze the nucleotide composition of partial genomic sequences and investigate whether the fluctuations in the NC-plot follow a Gaussian distribution.
Let us first consider matter. In a substance in the gaseous state, numerous molecules show random motions. These molecules frequently collide with each other, exchanging energy during these collisions. Hence, the motion of each individual molecule is constantly changing. When this random motion as a whole reaches equilibrium, the energy distribution (i.e., velocity distribution) becomes constant, and the temperature and pressure stabilize. At this stage, properties within small subsystems still exhibit fluctuations, but the distribution of these fluctuations is Gaussian, which allows to confirm that the system has reached equilibrium.
An analogous phenomenon may be observed in the compositional space derived from genome sequences of living organisms. In an individual organism, the overall nucleotide composition of the genome remains essentially constant. Nevertheless, for smaller subsequences within the genome, the nucleotide composition exhibits fluctuations that depend on the length of the sequence considered. Provided that mutational events are at equilibrium, these compositional fluctuations may conform to a Gaussian distribution. Furthermore, once equilibrium is established, the limited region corresponding to the habitable zone may appropriately be referred to as the phase diagram of life.
We perform the following genome analysis. The genome of a species consists of many genes, and the DNA sequence of each gene is a partial sequence of the entire genome. While the overall nucleotide composition of the genome remains nearly constant, the nucleotide composition within these partial sequences is expected to fluctuate considerably. By examining both the average magnitude and shape of the distribution of nucleotide composition fluctuations in genes, we can evaluate the nature of mutation introduction in the genome. In particular, if the emerging distribution is Gaussian, mutations are likely close to equilibrium.
Whole-genome analyses have been conducted for many species. In this study, we examined the nucleotide composition of all genes in the human genome and plotted them in a two-dimensional compositional space as a scatter plot of d₁ versus d₂, as shown in Fig. 1(A). The distributions projected onto the horizontal (d₁) and vertical (d₂) axes appear to be approximately Gaussian, as shown in Fig. 1(B). In other words, the nucleotide composition of partial genome sequences fluctuates randomly following an approximately Gaussian distribution. This suggests that the fluctuations in these partial sequences are nearly in equilibrium. Therefore, we consider that the habitable zone within the compositional distribution of the genome is the phase diagram of life.

Figure 1. (A) NC-plot (d₁ vs. d₂ scatter plot) of all genes in human genome and (B) projections of distribution onto d₁ and d₂ axes
The projections of the distribution onto the horizontal (d₁) and vertical (d₂) axes are approximately Gaussian. When fitted with a single Gaussian distribution, the standard deviations are nearly identical along both axes.
3.1. Standard deviation analysis of Gaussian distribution in NC-plot for all genes
As mentioned in the Section 2, the nucleotide composition of genes approximately follows a Gaussian distribution, indicating randomness and equilibrium. However, the human genome sequence also encodes the three-dimensional structures and functions of all proteins essential for human life. Therefore, the Gaussian distribution observed in the NC-plot of Fig. 1(A) likely reflects a complex internal structure. In this section, we discuss the various types of information embedded within the Gaussian distribution shown in Fig. 1(A).
Fig. 1(B) shows the results of fitting Gaussian distribution to the axis projection of the NC-plot in Fig. 1(A) derived from the human genome sequence. The obtained standard deviations are 0.0524 for d₁ and 0.0555 for d₂. In general, when four types of nucleotides—A, T, G, and C—are randomly generated N times, the standard deviation of their composition is given by eq. (4). By deriving the value of N from the standard deviation, we obtain N = 68 for d₁ and N = 61 for d₂. These values do not correspond to the size of an entire protein (approximately 580 residues) but to that of a subdomain.
The results above lead to a hypothesis regarding protein formation. Specifically, when considering the scale of compositional fluctuations, proteins can be viewed as assemblies of small repeating subdomain sequences. Even if a protein sequence does not initially appear repetitive, we reasonably suggest that, from the perspective of compositional fluctuations, proteins are constructed from repeated units with the approximate sizes of subdomains.
As the codon table is highly biased regarding the physical properties of amino acids, the size of the unit of compositional fluctuation corresponds to the scale of fluctuations in these physical properties. The three-dimensional structures of many proteins comprise combinations of subdomains. For example, in immunoglobulins, subdomains consisting of β-sheets are assembled to form the overall structure. Similarly, the structures of triose-phosphate isomerase barrel proteins are created by the repetition of α-helices and β-sheets. Hence, many proteins are constructed through the repetition of subdomains, and this is thought to be reflected in the size of their compositional fluctuations. Therefore, the fluctuation size (60–70), as indicated by the standard deviations shown in Fig. 1(B), suggests that the entire coding region of the genome fluctuates on a scale comparable to the size of protein subdomains. In other words, even complex proteins appear to be constructed, in essence, from the repetition of small units, implying that their formation process is relatively simple.

Figure 2. (A) Four representative proteins (▶︎, fibrinopeptide; ◀︎, globin; 🔺, cytochrome c; 🔻, histone) on NC-plot for all genes in human genome, (B) Double logarithmic plot of height (h) above contour according to evolutionary rate (v) per protein [11]
3.2. Relationship between NC-plot for genes and evolutionary rates of proteins
Proteins have another remarkable aspect in that each one evolves at a unique rate. Proteins with higher functional importance tend to have slower evolutionary rates. This is because mutations that disrupt the function of critical proteins can prevent an organism from developing properly or surviving natural selection, ultimately leading to its demise. Hence, functionally important proteins—and the genes that encode them—typically evolve slowly. Indeed, a strong correlation exists between the functional importance and evolutionary rate in proteins [10]. However, given that the central dogma of molecular biology dictates unidirectional information flow (DNA → RNA → protein), we consider the proposed causality—where protein function directly dictates evolutionary processes at the DNA level—to be biologically implausible. Instead, we hypothesize that a confounding factor at the DNA level (e.g., accuracy of the repair system) simultaneously influences both evolutionary rates and protein importance, thereby creating a spurious causal relationship.
Let us show the NC-plots of proteins with different evolutionary rates—histone, cytochrome c, globin, and fibrinopeptide—in Fig. 2(A) [10]. The evolutionary rate is defined as the time required to achieve a 1% change per 100 amino acid residues. This time is 500 million years for histone, 20 million years for cytochrome c, 5.8 million years for globin, and 1.1 million years for fibrinopeptide. Even among these four groups of proteins, the evolutionary rates differ by nearly two orders of magnitude. Fig. 2(A) shows contour lines at 10% intervals over a two-dimensional distribution, which serve as references for identifying the relative positions of each protein within the Gaussian distribution. Setting the peak of the distribution to 1, the corresponding positions (h) for the proteins are 0.5 for histone, 1.3 for cytochrome c, 4.0 for globin, and 6.5 for fibrinopeptide. Fig. 2(B) presents the correlation between these values (h) and evolutionary rates (v) on a double-logarithmic scale. We observe the following strong correlation:
| (5) |
This correlation represents a completely novel discovery, whose results are summarized as follows:
A clear correlation is observed between protein evolutionary rates and their positions in the Gaussian distribution. Generally, as the accuracy of genomic repair systems increases and frequency of mutation errors decreases, evolutionary rates slow down. In other words, if the precision of the repair system varies depending on the distance from a species-defining genomic target (i.e., peak of the distribution), the evolutionary rate may be directly correlated with the position in the distribution. Furthermore, the correlation between a protein functional importance and its evolutionary rate may be naturally explained by the positional relationships in the Gaussian distribution of the NC-plot.
The evolutionary rate refers to the time for new mutations to be established in a population. This rate is influenced by several factors, including the frequency of mutations, efficiency of DNA repair mechanisms, and strength of natural selection acting on the organism. Among the genes encoding the four types of proteins shown in Fig. 2, the evolutionary rates differ by more than two orders of magnitude. Despite these substantial differences, the Gaussian distribution in the NC-plot for genes does not exhibit a correspondingly large variation in evolutionary rate. Generally, when differences in reaction rates are not reflected in the overall outcome, the system is possibly in equilibrium with respect to the underlying random process. In this context, it is reasonable to assume that the nucleotide composition in each species has reached equilibrium, aligning with the concept of evolutionary neutrality.
3.3. Relationship between NC-plot for genes and protein folds
Fig. 2 also provides important insights into the three-dimensional structure of proteins because the folds of the four types of proteins are distinct. Therefore, the relationship between protein folds and their positions should be considered in the Gaussian distribution of the NC-plot. As discussed in [4], the number of amino acids that constitute the molecular recognition site in a protein is small compared with the entire amino acid sequence and can almost be regarded as noise. However, the protein fold that supports the molecular recognition site is formed by the entire amino acid sequence and may be closely related to the position in the NC-plot.
For the four types of proteins illustrated in Fig. 2(A), proteins with different folds are far from each other in the NC-plot. The SCOP (Structural Classification of Proteins) database classifies proteins into class (e.g., α/β), fold (structural folding patterns), superfamily (structure plus evolutionary relationships), and family (clear sequence homology at sequence level) [11]. Therefore, proteins that are similar in sequence (i.e., belong to the same family or superfamily) are expected to be close in the NC-plot. The family level represents groups of proteins that share common functions, and it is below the fold level. Thus, many families and folds can form a mosaic pattern in the Gaussian distribution fitted to the NC-plot.
Considerations of membrane proteins also reveal that the position in the NC-plot for genes depends on the type of protein. As thymine at the second position of codons encodes only hydrophobic amino acids, a high proportion of thymine at this position leads to the formation of membrane proteins. In this case, d₂ in the two-dimensional composition space becomes large. In other words, in the Gaussian distribution, membrane proteins composed of many transmembrane regions are likely located in areas with high d₂ values—a finding that has also been suggested by preliminary studies [7].
Various folds of water-soluble proteins can be considered similar in the sense that they are located in different regions of the NC-plot. Highly hydrophilic, fibrous, and globular proteins with hydrophobic regions, as well as proteins with other distinctive features have different amino acid properties and likely occupy distinct regions in the composition space. In other words, when translated into amino acid sequences via the codon table, these various folds form a mosaic pattern in the Gaussian distribution.
The NC-plot for genes in a genome follows a Gaussian distribution and may appear featureless. However, due to the bias in the physical properties of amino acids reflected by the codon table, a complex internal structure must underlie this Gaussian distribution. Conversely, a mosaic pattern in a Gaussian distribution may allow to map genomic sequences onto protein tertiary structures.
The codon table is a cipher that translates DNA codons into amino acids. Considering the NC-plot for genes, this table can also be regarded as another cipher for three-dimensional structures. If the mosaic structure in the NC-plot can be elucidated in advance, it may allow to predict protein folds to some extent based on the nucleotide composition. Furthermore, if folds can be predicted, it may be possible to partially infer functional types directly from a DNA sequence.

Figure 3 Two regulators of genomic change: regulation of nucleotide composition by intracellular factors and regulation of DNA sequence by external environmental factors
Genome* refers to the genome sequence in which mutations have been introduced, prior to the
formation of the organism.formation of the organism.
As shown in Fig. 2(B), all proteins appear to be orderly embedded in the Gaussian distribution. An important question is whether the factors determining the NC-plot derived from the genome are extracellular environmental factors (natural selection) or intracellular factors (genome processing systems). Although the survival of organisms is believed to be determined by natural selection, organisms are also robustly maintained by intracellular factors. While these concepts are highly complex, we attempted to address them through the analysis of viral genomes.
We briefly discussed this issue at the end of our book [4]. The work that led to the book was conducted during the novel coronavirus (COVID-19) pandemic; thus, we analyzed many COVID-19 variants. Consequently, our analysis was limited to only one type of virus. In NC-plots, we found that the host human genome and genomes of COVID-19 viruses that had infected humans appeared in close proximity [4].
In general, viruses entirely depend on the host genome processing system for replication. Inside a cell, a viral genome can, in a sense, be regarded as a partial sequence of the host genome. Therefore, if the NC-plot is determined by the genome processing system itself, the host and viral genomes should appear in very close proximity. In fact, as the COVID-19 and human genomes appear close to each other, we consider that the NC-plot is determined primarily by the host genome processing system (i.e., intracellular factors). However, this analysis was limited to only one type of virus, leaving several unsolved questions regarding discrepancies in the NC-plot between host and viral genomes. Viruses can be categorized into distinct types, such as DNA or RNA viruses, enveloped or non-enveloped viruses, and viruses with different host cell entry mechanisms. To clarify the influence of differences in viral types on the NC-plot, further analysis involving a wider variety of viruses is required.
Recently, our research group conducted a preliminary analysis of the genomes of several viruses that infect humans. Like for COVID-19, all analyzed viral genomes appeared close to the human genome in the NC-plot. Although more detailed investigation is necessary, these findings suggest that the NC-plot is determined by intracellular factors—specifically, the genome processing system.
This paper is an opinion article that presents several hypotheses, and many issues remain to be addressed. Nevertheless, we expect to convey that the influence of external environmental factors is secondary, and that the characteristics of organisms are fundamentally determined by intracellular factors—specifically, the genome processing system (Fig. 3). We believe that the NC-plot adequately represents this insight.