2024 Volume 89 Issue 3 Pages 187-195
During the decoding of the triplet code of nucleotides, known as a codon, the preference for certain synonymous codons affects the efficiency and accuracy of protein translation. Therefore, identifying codon usage bias provides insights into the gene expression system of a particular organism. The unicellular red alga Cyanidioschyzon merolae features a remarkably simple genome structure and a small number of genes for a eukaryotic organism, yet its codon usage code remains unclear. Here we report the synonymous codon usage frequencies for protein-coding genes in C. merolae. By comparing codon usage frequencies, we discovered that not only the wobble position—the third nucleotide of the codon—but also the first nucleotide contributes to codon bias. The overall difference in the frequency of codon usage among genes is attributed to the selectivity for the G/C base, particularly evident in highly expressed genes. The extremely narrow range of the codon adaptation index for protein-coding genes suggests that C. merolae has not undergone frequent gene uptake from extracellular sources to date. We found that not only highly expressed genes but also many functional genes have been optimized for a specific set of codon usage patterns. In contrast, codon bias was very low in genes thought to be newly created de novo in the genome. These results suggest that the evolution of genes has progressed through replacing to G/C base-containing synonymous codons, and that codon substitutions would continue until the codon usage frequency has adjusted to the ideal ratio for gene function or necessity.
The genetic code contained in DNA is transcribed into RNA and then translated into protein, enabling the expression of function. To decipher the code from nucleotide sequences to amino acid sequences, nucleotide triplets known as codons are converted into corresponding amino acids by the ribosome via amino acid-specific tRNAs. The mismatch between the number of nucleotide triplet patterns and the available amino acids leads to codon degeneracy, which results in each amino acid, except methionine and tryptophane, having 2, 3, 4 or 6 synonymous codons. Interestingly, even though synonymous codons encode the same amino acid, their usage frequencies in protein-coding genes are not identical—a phenomenon known as codon usage bias (Ikemura 1985; Sharp et al. 1986; Liu et al. 2021). Previously, synonymous codons were thought to have no biological effect because they do not change the amino acid sequence of proteins. However, the fact that highly expressed genes, such as ribosomal protein genes, are extremely enriched in frequently used codons suggests that codon usage correlates with gene expression levels (Sharp and Li 1987; Zhoua et al. 2016). It is also known that codon usage bias often relates to corresponding tRNA gene copy numbers (Dong et al. 1996; López et al. 2020). Thus, differences in codon usage frequencies among genes are expected to result in varying levels of gene expression. Moreover, the usage of synonymous codons is now thought to widely influence the translation process, including factors such as elongation rate (Yu et al. 2015), translation efficiency (Frumkin et al. 2018), initiation and termination (Zhou et al. 2018; Barrington et al. 2023), and accuracy (Liu 2020). Thus, codon usage bias is considered a fundamental layer in the regulatory mechanism of gene expression, but the reasons why gene expression levels are associated with codon usage bias are not fully understood. While optimal codons, which are prevalent in highly expressed protein-coding genes such as those encoding ribosomal proteins, have been shown to positively affect gene expression, the biological significance of rare codons remains unclear. Additionally, it has not been revealed why rare codons have not been eliminated from genomes in both prokaryotes and eukaryotes during evolution.
In this study, we characterized codon usage code in the unicellular red alga Cyanidioschyzon merolae (Matsuzaki et al. 2004), which has a simple genome structure, to gain insight into the relationship between codon usage bias and gene expression. The simplicity or primitive features of C. merolae are evident in the components of the central dogma of molecular biology. The nuclear genome contains only 4,751 protein-coding genes, 99% of which are single-exon genes (Nozaki et al. 2007). Furthermore, while a few hundred copies of rRNA genes are generally found in eukaryotic genomes, only three copies have been identified in C. merolae. The extremely low number of rRNA genes suggests that protein translation is highly efficient and optimized in this organism. Our results suggest that the selectivity of synonymous codons in an open reading frame (ORF) is likely related not only to its expression level but also to its biological function. The narrow range of the codon adaptation index (CAI) in ORFs indicates low genomic fluidity in the C. merolae genome, allowing us to consider gene evolution based on the degree of codon optimization for each gene. New genes arising from non-genic regions of the genome use synonymous codons randomly and many established genes have not optimized their codons perfectly, suggesting that the frequency of codon usage has been continuously optimized during evolution and that the optimal pattern of codon usage depends on their function in this simple unicellular eukaryote.
Complete coding sequences were retrieved from Cyanidioschyzon merolae Genome Project website (http://czon.jp/) for the Cyanidioschyzon merolae 10D and Ensemble genome browser website (https://asia.ensembl.org/index.html) for Escherichia coli str. K-12 substr. MG1655 and Saccharomyces cerevisiae S288C. Coding sequences, which are encoded in organelle genome or are smaller than 30 nucleotide length, were not used for the following analysis. A transcriptome dataset of periodic gene expression patterns using synchronized C. merolae cells was obtained from a previous study (Fujiwara et al. 2020).
Indices of codon usageIn order to normalize codon usage within coding sequences in datasets, relative synonymous codon usage (RSCU) was computed for 61 codons excluding stop codons. The value of RSCU for a codon is the observed frequency of a synonymous codon divided by the expected frequency, if all synonymous codons for each amino acid were used equally (Sharp and Li 1986). RSCU values close to 1.0 indicate no bias for the corresponding codon, whereas a RSCU value more than 1 indicates that a codon is used more frequently and less than 1 indicates that a codon is rarely used. RSCU values were calculated according to the formula of previous reports.
![]() |
where Xij is the number occurrences of the j-th codon for the i-th amino acid, and ni is the number from 1, 2, 3, 4, or 6 of alternative codons for the i-th amino acid.
The G+C content of the first (GC1), second (GC2) and third codon position (GC3) were then calculated. GC12 is the mean value of GC1 and GC2. These values were used for neutrality plots (Fig. 2A).

Scatter plot comparing variances of RSCU values calculated from cytosolic r-protein genes for each amino acid and amino acid frequencies in all ORFs (n=4,751). Acidic and basic amino acids are shown in red and blue, respectively. Hydrophobic and hydrophilic amino acids are shown in green and black, respectively.

(A–C) Scatter plots of mean G+C contents at the first and second codon positions (GC12) (A), effective number of codons (B) or codon adaptation index (C) versus G+C base contents at the third nucleotide position of codons (GC3). Red dashed lines in (A) indicate mean of GC12 and GC3. A red solid line in (B) indicates the expected values of effective number of codons if the codon bias is only due to GC3. Pearson correlation coefficients are shown in plots for (A) and (C).
The effective number of codons (ENC) is used to measure the magnitude of codon bias for an individual gene, which is essentially independent of gene length (Wright 1990; Novembre 2002). Values of ENC range from 20 (for a gene with extreme bias using only one codon per amino acids) to 61 (for a gene with no bias using synonymous codons equally). The values of ENC were calculated as
![]() |
where s represents the given GC3 value from 0 to 1.
The codon adaptation index (CAI) was used to estimate the extent of bias toward codons that were known to be preferred in highly expressed genes (Sharp and Li 1987). The CAI model assigns as a parameter, termed relative adaptiveness, which is defined as its frequency relative to the most often used synonymous codon in a set of highly expressed genes. The relative adaptiveness of a codon (wij) was first computed as
![]() |
where RSCUij is the number of occurrences of j-th codon for the i-th amino acid, and RSCUimax is the RSCU value of the most often used for encoding i-th amino acid in a set of highly expressed genes. The values of wk were computed from cytosolic r-protein genes as the reference gene set in this study. The CAI for a gene is then defined as the geometric mean of wk values for codons.
![]() |
where L is the number of codons in the gene excluding methionine, tryptophan, and stop codons, and wk is the w value for the k-th codon in the gene. The CAI ranges from 0 for no bias, which means all synonymous codons are used equally, to 1 for the strongest bias, which means only optimal codons are used and a potential higher expression level. All computations were performed using coRdon R package (https://github.com/BioinfoHR/coRdon) and Excel software. The calculated values for the number of codon occurrences in the ORFs, GC content, ENC and CAI are available in Supplementary Data S1. The actual Pearson correlation coefficients and the two-sided P values calculated using the t-distribution are summarized in Supplementary Table S3.
To investigate the pattern of codon usage in relation to gene expression levels in C. merolae, we first calculated the occurrence rate of each synonymous codon for each amino acid. Relative synonymous codon usage (RSCU) values were computed by dividing the observed number of codon occurrences by the expected number if all synonymous codons were equally frequent (see Materials and methods). For the RSCU calculation, ORFs of cytosolic ribosomal protein genes (r-protein genes) were used as the reference dataset, as these genes are highly transcribed throughout the cell cycle. RSCU values based on r-protein ORFs (RSCUr) showed that some specific codons are well adopted than other synonymous codons (Table 1). Although the genomic nucleotide content of C. merolae is not strongly biased towards G+C, with 55.0% in the genome overall and 56.7% in protein-coding genes, all preferred codons in the 59 informative codons end with G or C base. Thus, the selectivity of G or C base at the third nucleotide position of codon in protein-coding genes is likely to be a factor to determine synonymous codon usage in C. merolae.
| Amino acid | Triplet | r-protein ORFs | All ORFs | Emerging ORFs | |||
|---|---|---|---|---|---|---|---|
| Count | RSCU | Count | RSCU | Count | RSCU | ||
| Ala | GCU | 250 | 0.79 | 60,085 | 0.88 | 887 | 1.04 |
| GCC | 243 | 0.77 | 53,909 | 0.79 | 679 | 0.80 | |
| GCA | 314 | 0.99 | 69,998 | 1.03 | 959 | 1.13 | |
| GCG | 463 | 1.46 | 88,467 | 1.30 | 881 | 1.03 | |
| Cys | UGU | 74 | 0.79 | 18,222 | 0.83 | 362 | 0.82 |
| UGC | 113 | 1.21 | 25,827 | 1.17 | 522 | 1.18 | |
| Asp | GAU | 228 | 0.84 | 59,706 | 0.99 | 789 | 1.08 |
| GAC | 314 | 1.16 | 61,378 | 1.01 | 672 | 0.92 | |
| Glu | GAA | 290 | 0.74 | 63,644 | 0.82 | 1,136 | 1.04 |
| GAG | 491 | 1.26 | 91,451 | 1.18 | 1,048 | 0.96 | |
| Phe | UUU | 197 | 0.99 | 38,059 | 0.94 | 595 | 1.00 |
| UUC | 199 | 1.01 | 43,032 | 1.06 | 591 | 1.00 | |
| Gly | GGU | 321 | 1.24 | 43,510 | 1.11 | 515 | 0.97 |
| GGC | 347 | 1.34 | 52,957 | 1.35 | 585 | 1.10 | |
| GGA | 206 | 0.79 | 33,930 | 0.86 | 611 | 1.15 | |
| GGG | 163 | 0.63 | 27,082 | 0.69 | 421 | 0.79 | |
| His | CAU | 99 | 0.66 | 27,749 | 0.91 | 524 | 1.09 |
| CAC | 201 | 1.34 | 33,569 | 1.09 | 436 | 0.91 | |
| Ile | AUU | 244 | 1.06 | 32,310 | 1.08 | 470 | 1.09 |
| AUC | 361 | 1.57 | 42,927 | 1.44 | 469 | 1.09 | |
| AUA | 87 | 0.38 | 14,363 | 0.48 | 355 | 0.82 | |
| Lys | AAA | 437 | 0.75 | 30,239 | 0.89 | 559 | 1.01 |
| AAG | 727 | 1.25 | 37,703 | 1.11 | 547 | 0.99 | |
| Leu | UUA | 59 | 0.32 | 11,851 | 0.28 | 256 | 0.47 |
| UUG | 194 | 1.06 | 41,189 | 0.97 | 573 | 1.05 | |
| CUU | 189 | 1.03 | 39,074 | 0.92 | 666 | 1.22 | |
| CUC | 289 | 1.58 | 64,385 | 1.52 | 653 | 1.19 | |
| CUA | 74 | 0.40 | 19,764 | 0.47 | 351 | 0.64 | |
| CUG | 295 | 1.61 | 77,511 | 1.83 | 784 | 1.43 | |
| Met | AUG | 286 | 1.00 | 45,112 | 1.00 | 608 | 1.00 |
| Asn | AAU | 88 | 0.53 | 24,191 | 0.77 | 496 | 0.92 |
| AAC | 243 | 1.47 | 38,905 | 1.23 | 585 | 1.08 | |
| Pro | CCU | 119 | 0.77 | 25,022 | 0.74 | 464 | 0.90 |
| CCC | 113 | 0.73 | 26,616 | 0.79 | 376 | 0.73 | |
| CCA | 133 | 0.86 | 32,360 | 0.95 | 597 | 1.16 | |
| CCG | 256 | 1.65 | 51,612 | 1.52 | 620 | 1.21 | |
| Gln | CAA | 146 | 0.61 | 37,404 | 0.70 | 656 | 0.96 |
| CAG | 331 | 1.39 | 68,756 | 1.30 | 708 | 1.04 | |
| Arg | CGU | 287 | 1.09 | 41,747 | 1.16 | 543 | 1.03 |
| CGC | 448 | 1.70 | 61,906 | 1.72 | 649 | 1.23 | |
| CGA | 315 | 1.19 | 46,135 | 1.28 | 642 | 1.21 | |
| CGG | 335 | 1.27 | 37,653 | 1.05 | 444 | 0.84 | |
| AGA | 94 | 0.36 | 14,412 | 0.40 | 520 | 0.98 | |
| AGG | 106 | 0.40 | 13,578 | 0.38 | 380 | 0.72 | |
| Ser | UCU | 97 | 0.83 | 24,223 | 0.74 | 558 | 0.90 |
| UCC | 107 | 0.92 | 28,859 | 0.88 | 536 | 0.86 | |
| UCA | 92 | 0.79 | 24,277 | 0.74 | 611 | 0.98 | |
| UCG | 168 | 1.44 | 47,944 | 1.47 | 726 | 1.17 | |
| AGU | 87 | 0.75 | 25,815 | 0.79 | 488 | 0.79 | |
| AGC | 147 | 1.26 | 45,199 | 1.38 | 806 | 1.30 | |
| Thr | ACU | 108 | 0.66 | 22,936 | 0.67 | 516 | 0.84 |
| ACC | 162 | 0.98 | 35,391 | 1.03 | 448 | 0.73 | |
| ACA | 122 | 0.74 | 29,429 | 0.86 | 708 | 1.15 | |
| ACG | 267 | 1.62 | 49,436 | 1.44 | 793 | 1.29 | |
| Val | GUU | 293 | 1.09 | 40,965 | 0.99 | 630 | 1.19 |
| GUC | 242 | 0.90 | 42,596 | 1.03 | 544 | 1.03 | |
| GUA | 121 | 0.45 | 23,873 | 0.57 | 399 | 0.76 | |
| GUG | 420 | 1.56 | 58,641 | 1.41 | 539 | 1.02 | |
| Trp | UGG | 113 | 1.00 | 37,507 | 1.00 | 456 | 1.00 |
| Tyr | UAU | 124 | 0.67 | 23,124 | 0.82 | 310 | 0.96 |
| UAC | 247 | 1.33 | 33,609 | 1.18 | 339 | 1.04 | |
Codons associated with the permuted tRNAs are shown in bold and italic.
Next, we compared the nucleotide preferences at the first nucleotide position in the codons. As only leucine, arginine and serine are determined by synonymous codons with different first nucleotides, we examined the nucleotide preferences at the first nucleotide position using the codon usage frequencies associated with these amino acids. As a result, the mean RSCU values for G/C base start codons compared to A/U base start codons are more than 1.5 times higher (1.67 for leucine and 3.46 for arginine), whereas the ratio of A base start codons to U base start codons for serine is 1.01. To assess the substantial impact of codon bias related to the first nucleotide selectivity in the C. merolae genome, we considered the frequency of occurrence of each amino acid in the ORFs. Among the amino acids, occurrences of leucine (10.3%), arginine (8.7%) and serine (8.0%), which are determined by synonymous codons with different first nucleotides, are more abundant than others in the ORFs (Table 2). Given that the variances in RSCU values for leucine and arginine, as well as their occurrence frequencies in ORFs, are very high (Fig. 1 and Table 2), the preference for codons starting with G/C and the avoidance of codons starting with A/U for these amino acids suggest that the first nucleotide position of the codons is also a crucial factor in genome-wide codon bias. These results suggest that not only G/C base selectivity at the third position, but also G/C base selectivity at the first nucleotide position is related to codon usage bias in C. merolae.
| Amino acid | Count | Fraction | RSCU variance | ||
|---|---|---|---|---|---|
| r-protein ORFs | All ORFs | Emerging ORFs | |||
| Ala | 272,459 | 0.111 | 0.078 | 0.037 | 0.015 |
| Cys | 44,049 | 0.018 | 0.043 | 0.030 | 0.033 |
| Asp | 121,084 | 0.049 | 0.025 | 0.000 | 0.006 |
| Glu | 155,095 | 0.063 | 0.066 | 0.032 | 0.002 |
| Phe | 81,091 | 0.033 | 0.000 | 0.004 | 0.000 |
| Gly | 157,479 | 0.064 | 0.088 | 0.062 | 0.019 |
| His | 61,318 | 0.025 | 0.116 | 0.009 | 0.008 |
| Ile | 89,600 | 0.036 | 0.237 | 0.156 | 0.016 |
| Lys | 67,942 | 0.028 | 0.062 | 0.012 | 0.000 |
| Leu | 253,774 | 0.103 | 0.254 | 0.296 | 0.114 |
| Met | 45,112 | 0.018 | 0.000 | 0.000 | 0.000 |
| Asn | 63,096 | 0.026 | 0.219 | 0.054 | 0.007 |
| Pro | 135,610 | 0.055 | 0.143 | 0.097 | 0.038 |
| Gln | 106,160 | 0.043 | 0.150 | 0.087 | 0.001 |
| Arg | 215,431 | 0.087 | 0.229 | 0.230 | 0.034 |
| Ser | 196,317 | 0.080 | 0.068 | 0.092 | 0.032 |
| Thr | 137,192 | 0.056 | 0.143 | 0.081 | 0.051 |
| Val | 166,075 | 0.067 | 0.159 | 0.088 | 0.025 |
| Trp | 37,507 | 0.015 | 0.000 | 0.000 | 0.000 |
| Tyr | 56,733 | 0.023 | 0.110 | 0.034 | 0.002 |
In addition to the canonical tRNA group, C. merolae contains the specific tRNA group, called permuted tRNA, which requires additional processing to mature because the 3′ and 5′ halves of the tRNA are swapped in the genome (Soma et al. 2007, 2013). Due to the reason, synonymous codons corresponding to the permuted tRNAs make it possible to influence the codon usage bias in C. merolae. At this moment, we have not identified specific trends in the codon usage frequencies of permuted tRNAs, but the codons associated with the permuted tRNAs include both preferred and non-preferred codons (Table 1). This suggests that permuted tRNAs could potentially influence codon usage bias.
Assessment of codon structure at genome-wide single-nucleotide resolutionTo identify the driving force in shaping codon usage bias, we examined neutrality analysis by comparing average G+C content at the first and the second codon position (GC12) and at the third codon position (GC3) for each ORF (Sueoka 1999) (Fig. 2A). In general, a statistically significant correlation between GC12 and GC3 in ORFs suggests that mutation bias is the main force shaping the codon usage pattern. In contrast, a narrow range of GC content distribution and no correlation between GC12 and GC3 suggests the presence of a force that promotes selection of codon usage. As a result, the GC content profiles for the protein-coding genes showed that more than 95% of the genes were in the 0.2 GC content range for both GC12 and GC3, and we found no correlation between GC12 and GC3 (Pearson r=−0.01, p=0.468). Thus, the synonymous codon usage pattern in C. merolae seems to be mainly originated from selection bias, not mutation bias.
The genome-wide codon bias in C. merolae was then assessed by two types of specific quantities called effective number of codons and CAI. The comparison of the expected and actual number of codons for each ORF allows an assessment of how far the codon usage of an ORF differs from the equal usage of synonymous codons. As the calculation, while the expected number of codons for an ORF should depend on GC3 (red solid line), many spots indicating ORFs are gone away from the expected values in any range of GC3 (Fig. 2B). Interestingly, the data indicated that codon usage bias could be classified into three types: ORFs with GC3-dependent bias, which have adopted the expected number of codons determined by GC3 values; ORFs with strong codon bias, which have adopted fewer codons than expected by GC3 values; and ORFs with no bias, characterized by highly random codon usage.
Next, we calculated the CAI for each ORF, which is widely used as an index of codon optimization for efficient translation. The reference data set for calculating CAI values are the r-protein genes. The top 50 genes with high CAI values across the ORFs have included histone genes consisting of the core histone and 31 of the multi-copied genes (Supplementary Table S1). Given the high copy number of core histones required for DNA packaging and the fact that the limited number of duplicated genes are in the C. merolae genome, it is likely that the expression levels of these genes are very high in vivo. These facts suggest that evaluation using CAI would be reliable for estimating optimization of their codon usage, even in the case of the C. merolae ORFs. When compared with CAI values, only GC3 content shows a positive correlation (Pearson r=0.81, p<0.01) (Fig. 2C), while there is no correlation between CAI values and GC12 (Pearson r=−0.07, p<0.01). Thus, the results also suggest that the genes with higher CAI value have a greater degree of codon usage bias through selection, favoring the synonymous codons ending in G or C base.
Low genomic fluidity of C. merolae genomeGene gain by lateral transfer could not only affect genomic fluidity (Soucy et al. 2015; Sibbald et al. 2020), but also alter the codon usage frequency in an organism. Thus, the range of CAI values would reflect an aspect of genomic fluidity in the case of unicellular organisms. By comparing the CAI values of three unicellular organisms (the red alga C. merolae, the proteobacterium Escherichia coli, and the yeast Saccharomyces cerevisiae), we found that the CAI profile of C. merolae genes is remarkably narrower compared to those of the other organisms (Fig. 3). In addition to the narrow range, the CAI profile of C. merolae is characterized by a higher mean value (0.734±0.025) and unimodality. These facts suggest that the C. merolae genome has greater consistency or unity than those of the other organisms, indicating advanced codon optimization. In fact, because C. merolae’s habitat is confined to very acidic and high-temperature conditions, such as hot springs, the number of species in its ecological environment and opportunities to acquire foreign DNAs are likely fewer than in the other organisms compared. Therefore, given the low genomic fluidity of the C. merolae genome, we then analyzed the codon optimization of genes in terms of gene evolution.

Histograms of codon adaptation index for ORFs in C. merolae 10D (n=4,751), Escherichia coli str. K-12 substr. MG1655 (n=4,239), and Saccharomyces cerevisiae S288C (n=6,550). Mean values are 0.734±0.025 for C. merolae, 0.447±0.099 for E. coli, and 0.447±0.101 for S. cerevisiae.
If optimization of codon usage has been facilitated for all genes through evolution, not only codon usage pattern for highly expressed ORFs but also those for other ORFs would be adjusted to efficiently achieve their gene function. To verify the hypothesis, we calculated RSCU by using all ORFs (RSCUa) (Table 1). As a result, we found that overview of codon usage patterns in these two groups are well similar to each other with significant correlation (Pearson r=0.952, p<0.01). The basal trend in codon usage patterns remained consistent, even for genes expressed during specific periods according to the cell cycle (Supplementary Table S2). These facts suggest that the underlying bias in codon usage patterns is likely not related to the expression levels of individual genes, but rather determined by intrinsic factors within the C. merolae cell, such as the abundance of available tRNAs. Additionally, variances in RSCU values calculated by r-protein ORFs (mean variance for RSCUr=0.109) are higher than those calculated by all ORFs (mean variance for RSCUa=0.070), and the selectivity for preferred synonymous codons also appears to be greater in the group of the highly expressed ORFs than in all ORFs (Tables 1 and 2). Consequently, since G/C-ending preferred codons are well accepted in highly expressed ORFs, this bias may help stabilize the codon-anticodon interaction through increased hydrogen bonding, potentially enhancing the efficiency of protein translation by ribosomes.
Codon usage patterns in de novo genesDe novo gene birth is the process by which new genes arise from non-genic regions of the genome (Carvunis et al. 2012; Van Oss and Carvunis 2019). Lastly, we examined the codon usage patterns in genes resulting from the de novo gene birth process. Although the mechanism of de novo gene birth is still not clear, one certain process is that a non-genic region first acquires a transcribed function and an ORF in either order through mutation, resulting in the emergence of a novel gene that is born de novo and is not related to any pre-existing genes. In this case, the codon usage of newly born de novo genes would be completely random, rather than adapted to a specific codon usage pattern for the organism. As the simplest definition of a de novo gene is that it has no obvious homology with all reported genes, we focused on the genes which encode a non-conserved and presumably newly born ORF, hereafter emerging ORFs, for the analysis.
Interestingly, when all ORFs were sorted according to CAI values, we found that the ratio of the number of de novo genes was negatively correlated with CAI values (Pearson r=−0.917, p<0.01). Among the bottom 250 ORFs with low CAI values, 117 ORFs (46.8%) are classified as de novo genes (Fig. 4). We classified the bottom 100 de novo genes with low CAI as emerging ORFs and calculated RSCU values for these genes (RSCUe) to evaluate their codon usage frequencies (Table 1). The results indicated that the codon usage pattern of emerging ORFs follows a similar trend to that of highly expressed ORFs (Pearson r=0.648, p<0.01) (Fig. 5A), yet the distribution of RSCUe values is smaller than those of RSCUr or RSCUa (Fig. 5B). Additionally, the mean variance of the RSCUe values for 61 codons was calculated to be 0.030, which is smaller than those for RSCUr (0.132) and RSCUa (0.093), suggesting that the codon bias for emerging ORFs is very weak. Thus, as expected, emerging ORFs resulting from de novo gene birth have random codon usage patterns. Taken together, these facts suggest that synonymous codons in a gene have been optimized by two approaches. During gene evolution, codon usage is adapted by optimizing for intrinsic factors determined by cell structure and composition, such as the amount of tRNA available, and then replacing codons with other synonymous codons in a specific ratio for gene function or necessity.

The histogram shows the number of genes encoding non-conserved ORFs in each of the 250 genes. Genes are sorted according to the values of codon adaptation index.

(A) Scatter plot of the RSCU values calculated for r-protein ORFs and the RSCU values for all ORFs or emerging ORFs. (B) Distributions of RSCU values for genes encoding cytosolic r-protein ORFs, all ORFs and emerging ORFs. The bottom 100 genes with a low value of the CAI are shown as representative of emerging ORFs.
Previous studies have shown that synonymous codon usage patterns affect translation speed and efficiency (Spencer et al. 2012; Yu et al. 2015). However, the mechanisms and effects of codon usage alterations on gene expression are still not fully understood. By studying codon usage patterns in the unicellular alga C. merolae with a simple genome, we proposed that stepwise optimizations of codon usage have occurred for each gene during evolution. Since frequently and rarely used codons were common to all ORFs regardless of gene expression level, the most fundamental process of codon optimization across all ORFs is likely to be substitution to preferred synonymous codons and avoidance of the use of non-preferred codons. By analyzing G+C content in codons and comparing codon usage frequencies, we identified a certain trend that determines which synonymous codons are preferred and which are not. Although there are a very few exceptions, synonymous codons with G or C bases are more likely to be accepted than codons with A or U bases. Furthermore, as it was clearly seen in the synonymous codon usage frequencies for leucine and arginine, the optimization process is not only common at the third nucleotide position, but also at the first nucleotide position. Since the trend is common in not only highly expressed ORFs but also all ORFs, the selection of a representative synonymous codon would be an optimization to focus intracellular resources on the synthesis of the corresponding tRNA and to carry out efficient protein translation. In the case of highly expressed ORFs, the representative synonymous codons are adopted more frequently, and the use of the other synonymous codons is reduced. As all representative synonymous codons have G or C base at the first and/or the third position, codon-anticodon interaction via hydrogen bonds between G/C base pair would contribute to more certain and efficient translation from mRNA in a given time.
Optimization of codon usage frequency via gene evolutionDifferent patterns of codon usage between de novo genes and other functional genes suggest a process of gene evolution. According to the de novo gene birth model, de novo genes are newly created in non-genic regions, and their codon usage is assumed to be random. However, our results showed that while the bottom 100 de novo genes with low CAI values are not biased towards optimal synonymous codons and their codon usage is highly random, the CAI values for all de novo genes span a wide range. These facts suggest that genes have evolved their expression efficiency by modifying codon usage frequency during evolution.
Several hypotheses have been proposed to explain the origin of codon usage bias. One of these, called the selection-mutation-drift theory (Bulmer 1991), suggests that codon usage bias results from a balance between selection, mutational pressure, and genetic drift. However, recent studies using genome sequences from various organisms, including both bacteria and eukaryotes, have shown that there are different types of bias that can affect the frequency of codon usage (Quax et al. 2015). Not only the frequency of synonymous codons in an ORF but also the order of synonymous codons is biased and this is called as synonymous codon co-occurrence bias (Cannarrozzi et al. 2010; Shao et al. 2012; Zhang et al. 2013). This phenomenon also suggests that the presence of non-preferred codons could be an indispensable regulator in the central dogma of molecular biology. Indeed, our results indicate that the gene group associated with cell and organelle division are not optimized like typically highly expressed genes, despite the need for high copy numbers of protein molecules from these genes. Given that many genes expressed during specific cell cycle phases in C. merolae are not fully optimized, the inefficient codon usage pattern might actually represent the appropriate codon frequency to fulfill their functions (Supplementary Table S2). Recent studies have demonstrated that preferred codons enhance the protein translation elongation rate, while non-preferred codons reduce it (Spencer et al. 2012; Yu et al. 2015). Although protein translation in C. merolae is not always at its fastest, our results suggest that the temporal regulation of protein translation through suboptimal codon usage is likely to facilitate appropriate protein folding speed and accuracy for functional proteins.
Generally, to study the molecular function of a protein of interest in a cell from another organism, synonymous codons in the gene encoding the protein are adapted and designed according to the codon usage frequencies of that organism. However, the synonymous codon usage code of many genes encoding a functional protein does not perfectly match with that of the typical high expressed genes. Due to this, our results serve as a warning that excessive use of preferred synonymous codons in biological functional analysis might lead to misfolding of the target protein and misinterpretation of the phenomenon.
This work was supported by PRESTO from the Japan Science and Technology Agency (JPMJPR20EE to Y.Y.); the Human Frontier Science Program Career Development Award (no. CDA00049/2018-C to Y.Y.); Japan Society for the Promotion of Science KAKENHI (nos. JP18K06325 and 22H02653 to Y.Y.); and the Institute for Fermentation, Osaka (L-2020-2-008 to Y.Y.). We thank our lab colleagues for their support and advice during this project.
Conceptualization: Y.K., Y.Y.; Methodology: Y.K., S.K., Y.Y.; Investigation: Y.K., S.K., Y.Y.; Writing – original draft: Y.K., Y.Y.; Writing – review & editing: Y.K., S.K., Y.Y.; Visualization: Y.K., Y.Y.; Supervision: Y.Y.; Project administration: Y.Y.; Funding acquisition: Y.Y.
Supplementary information including Tables S1, S2, S3 and Data S1 is available online.
This is an invited article commemorating the author’s award of the CYTOLOGIA Encouragement Award.