The Tohoku Journal of Experimental Medicine
Online ISSN : 1349-3329
Print ISSN : 0040-8727
ISSN-L : 0040-8727
Regular Contribution
Trinucleotide Substitutions at Two Locations in the SARS-CoV-2 Nucleocapsid (N) Gene
Tetsuya AkaishiKei FujiwaraTadashi Ishii
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2023 Volume 260 Issue 1 Pages 21-27

Details
Abstract

The genomes of sarbecoviruses, including severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), incorporate mutations with short sequence exchanges based on unknown processes. Currently, the presence of such short-sequence exchanges among the genomes of different SARS-CoV-2 lineages remains uncertain. In the present study, multiple SARS-CoV-2 genome sequences from different clades or sublineages were collected from an international mass sequence database and compared to identify the presence of short sequence exchanges. Initial screening with multiple sequence alignments identified two locations with trinucleotide substitutions, both in the nucleocapsid (N) gene. The first exchange from 5'-GAT-3' to 5'-CTA-3' at nucleotide positions 28,280-28,282 resulted in a change in the amino acid from aspartic acid (D) to leucine (L), which was predominant in clade GRY (Alpha). The second exchange from 5'-GGG-3' to 5'-AAC-3' at nucleotide positions 28,881-28,883 resulted in an amino acid change from arginine and glycine (RG) to lysine and arginine (KR), which was predominant in GR (Gamma), GRY (Alpha), and GRA (Omicron). Both trinucleotide substitutions occurred before June 2020. The sequence identity rate between these lineages suggests that coincidental succession of single-nucleotide substitutions is unlikely. Basic local alignment search tool sequence search revealed the absence of intermediating mutations based on single-base substitutions or overlapping indels before the emergence of these trinucleotide substitutions. These findings suggest that trinucleotide substitutions could have developed via an en bloc exchange. In summary, trinucleotide substitutions at two locations in the SARS-CoV-2 N gene were identified. This mutation may provide insights into the evolution of SARS-CoV-2.

Introduction

The pandemic of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is still categorized among the primary public health concerns globally in 2023 (Johns Hopkins University 2023). In early 2022, the largest outbreak surge was reported worldwide owing to the emergence of the Omicron variant (Karim and Karim 2021; Araf et al. 2022). During the pandemic, the intermittent emergence of multiple variants of concern (VOC) was observed, which played a primary role in maintaining the pandemic across various countries worldwide (Mlcochova et al. 2021; Tao et al. 2021). The mutation profiles in previous VOCs have been primarily evaluated and categorized based on single nucleotide substitutions (i.e., point mutations), especially in the receptor-binding domain (RBD) of the SARS-CoV-2 spike (S) gene S1 subunit (SeyedAlinaghi et al. 2021; Hajizadeh et al. 2022). Recently, hotspots of insertions/deletions (indels) have been reported in the open reading frame 1a (ORF1a) polyprotein-encoding gene and N-terminal domain (NTD) of the S1 gene in various SARS-related coronaviruses (Akaishi 2022a), suggesting the importance of mutations in these genomic sites that are different from S1 RBD in the subgenus Sarbecovirus. Moreover, highly polymorphic indel sites have been identified in SARS-CoV-2 S1-NTD sampled from humans (Akaishi et al. 2022a). In addition to these traditional mutations with single nucleotide substitutions or indels, thorough exchange of short sequence with unknown mechanisms has been suggested to exist in the genomes of SARS-related coronaviruses (Akaishi et al. 2022b). However, the exact developmental processes of such short-sequence exchanges or the presence of sequence exchanges among different SARS-CoV-2 lineages remains uncertain. To search for traits and developmental processes of such a thorough exchange of consecutive nucleotides in SARS-CoV-2 genomes, the present study compared multiple whole genome sequences of SARS-CoV-2 belonging to different lineages and examined for such an exchange of consecutive bases across the genomes that cannot be simply explained by the combination of conventional mutation types, such as point mutations, insertions, and deletions.

Methods

Evaluation of SARS-CoV-2 genome sequences

In the present study, 49 SARS-CoV-2 genome sequences sampled from humans were collected and compared to search for the presence of exchanges in consecutive bases that cannot be attributed to the combinations of point mutations, insertions, and deletions. The genome sequence of the original Wuhan-Hu-1 strain was obtained from the NCBI GenBank database (Bethesda, MD, USA) with accession ID MN908947.3. Other sequences from subsequent lineages were obtained from the Global Initiative on Sharing All Influenza Data (GISAID) database (Munich, Germany) (Elbe and Buckland-Merrett 2017; Shu and McCauley 2017; Khare et al. 2021; GISAID 2023), which were registered and available by December 22nd, 2022. Four sequences were randomly selected from the database for each of the following clades or lineages: L, GH (Beta), GR (Gamma), G, GRY (Alpha), GK (Delta), and GRA (Omicron), with sublineages BA.1, BA.2, BA.5, BQ.1, BQ.1.1, and XBB. A list of evaluated 49 genome sequences is shown in Table 1.

Multiple sequence alignments and sequence identity analysis

With 49 whole genome sequences (nt 1-29,903), multiple sequence alignments were performed using Molecular Evolutionary Genetics Analysis Version 11 (MEGA11) software (Tamura et al. 2021). Regarding the alignment parameters, the gap opening penalty score was set to –400, and the gap extension penalty score was set to 0. Multiple sequence alignments were performed after dividing the collected sequences into two sets, each with the original Wuhan-Hu-1 sequence and 24 sequences from subsequent lineages (two sequences from each clade or sublineage). Using the aligned sequences, base positions with exchanges of ≥ 3 consecutive nucleotides were examined across the whole genomes. Furthermore, for sequences with exchanges of consecutive nucleotides, the sequence identity rate (%) in comparison with the original Wuhan-Hu-1 genome sequence was calculated using Multiple Alignment using Fast Fourier Transform (MAFFT) software, offered by the European Molecular Biology Laboratory (EMBL) (European Molecular Biology Laboratory 2023).

Ethics and data availability

The present study was approved by the institutional review board of Tohoku University Graduate School of Medicine (approval number: 2022-1-720). The findings of this study are based on metadata associated with 14,329,052 sequences, which were sampled from humans and available on GISAID up to December 22nd, 2022, via EPI_SET_230112cr.

Table 1.

List of the evaluated 49 genome sequences of SARS-CoV-2 sampled from humans.

The genome sequence of the original Wuhan-Hu-1 was obtained from the NCBI GenBank database. The other 48 sequences, 4 from each clade or sublineage, were randomly selected from the GISAID database from the overall sequences that were registered and available by December 22nd, 2022.

GISAID, Global Initiative on Sharing All Influenza Data.

Results

Sites with an exchange of ≥ 3 consecutive nucleotides

Two sets of multiple sequence alignments with different sets of 25 sequences identified two sites with base exchanges in three consecutive nucleotides, both in the N gene. The first site was located at the N-terminal domain of the coding region of the N gene corresponding to the nucleotide positions of 28,280-28,282 nt in the SARS-CoV-2 genome sequence, and a change in the nucleotide sequence from “GAT” to “CTA” was observed. The coded amino acid was changed from aspartic acid (D) to leucine (L). This first trinucleotide exchange site was identified in both evaluated sequences from clade GRY (Alpha). The second site was located at the nucleotide positions of 28,881-28,883 nt in the middle of the SARS-CoV-2 N gene, and a change in the nucleotide sequence from “GGG” to “AAC” was observed. The coded amino acids were changed from the succession of arginine and glycine (203-204 amino acids: RG) to lysine and arginine (KR). This second trinucleotide exchange site was identified in the clades GR (Gamma), GRY (Alpha), and GRA (Omicron), including all evaluated sublineages of Omicron (BA.1, BA.2, BA.5, BQ.1, BQ.1.1, and XBB). The results of multiple sequence alignments with the evaluated 25 sequences at these two sites of trinucleotide exchanges are shown in Fig. 1.

Probability of successive point mutations

Next, to exclude the possibility that the observed trinucleotide exchanges in the SARS-CoV-2 N gene were developed by the gradual accumulation of point nucleotide substitutions coincidentally at three successive nucleotide positions, the expected probability of observation of three successive point mutations was estimated based on the overall mutation rates between the original Wuhan-Hu-1 and one of the evaluated SARS-CoV-2 genomes from clade GRY (EPI_ISL_916362). Based on EMBL MAFFT sequence identity analysis of the two sequences, the sequence identity was 99.91%, after excluding nucleotide positions with indels. Based on the nucleotide mutation rate of 0.09% (9 in 10,000 nucleotides), the expected probability of observing point mutations in three successive nucleotides across 29,903 bases (Wuhan-Hu-1 whole genome) was 29,903 × (9E-4)^3 ≈ 2.180E-05 (i.e., 2.180E-03%). As we observed two sites with a three-base exchange, the probability was (2.180E-05)^2 ≈ 4.752E-10 (i.e., 4.752E-8%). This value was sufficiently low; hence, we could conclude that the two sites of trinucleotide exchange may not have developed by a coincidental succession of single base substitutions in three consecutive nucleotides.

Number of sequences with each amino acid substitution in the GISAID database

Next, the overall number of registered sequences in the GISAID database with each of the observed three amino acid substitutions was investigated. These values were further evaluated using different GISAID clades and sublineages. The results are presented in Table 2. N_D3L substitution was observed in 98.80% of sequences from the clade GRY (Alpha); however, it was also noted in a small number of sequences from the clade G (5.45%) and GR (Gamma; 16.72%). Meanwhile, the successive N_R203K and N_G204R substitutions were almost exclusively observed in GR (Gamma; 97.08%), GRY (Alpha; 92.68%), and GRA (Omicron; 95.93%).

Intermediating mutations based on single-base substitutions

To further rule out the possibility of gradual accumulation of single nucleotide substitutions in the first three-base exchange site at 28,280-28,282 nt, the numbers of registered sequences in the GISAID database with conceivable intermediating types of single nucleotide substitution linking 5'-GAT-3' to 5'-CTA-3' were evaluated. Three possible single-point substitutions are possible: 5'-GAT-3' (amino acid: D) to 5'-CAT-3' (amino acid: H), 5'-GTT-3' (amino acid: V), and 5'-GAA-3' (amino acid: E). Among the registered overall 14,329,052 sequences, the number of registered sequences with N_D3H substitution was 91 (0.0006%), that with N_D3V substitution was 259 (0.002%), and that with N_D3E substitution was 913 (0.006%) sequences. These very low frequencies of conceivable single nucleotide substitutions linking 5'-GAT-3' to 5'-CTA-3' at nt 28,280-28,282 supported that the observed trinucleotide exchange could have occurred simultaneously as an en bloc sequence exchange. As this three-base exchange started to gradually increase in clade G, the prevalence of sequences with each of the three possible intermediating substitutions was further evaluated among the 354,434 sequences belonging to clade G. The number of sequences with N_D3H was 21 (0.006%), that with N_D3V was 12 (0.003%), and that with N_D3E was 155 (0.04%). Again, the results supported that the three-base exchange occurred en bloc at once and not via gradual accumulation of single base substitutions in successive base positions.

Intermediating mutations based on overlapping indels

Next, to exclude the possibility of two-step indel process (i.e., a three-base insertion after a three-base deletion or a three-base deletion after a three-base insertion) for the first three-base exchange (amino acid: D > L) at the N-terminus of N gene, a GISAID database search with N_ins3L, N_ins4L, and N_D3del was performed. None of the three indel patterns were identified among the overall 14,329,052 sequences registered in the GISAID database (n = 0/14,329,052; 0.0%, for all of the three types). This finding suggests that a multistep developmental process of three-base exchange based on overlapped three-base insertion and three-base deletion is unlikely.

Examination via BLAST for conceivable intermediating mutations

Finally, to estimate the period of each conceivable intermediating sequence, with substitutions in two of the three nucleotide positions for the two locations in the N gene, a sequence search was performed using the Basic Local Alignment Search Tool (BLAST) from the NCBI. The obtained results, including the confirmed first date and the location of each intermediate sequence, are summarized in Table 3. In the first location with a trinucleotide substitution at 28,280-28,282 nt, the confirmed oldest mutant with 5'-CTA-3' dated back to May 24th, 2020 (GenBank Accession: ON299968.1), whereas the earliest intermediating sequences with two nucleotide substitutions dated back to December 10th, 2020, which was much later than the emergence of the mutant with the trinucleotide substitution. In the second location with a trinucleotide substitution at 28,881-28,883 nt, the confirmed oldest mutant with 5'-AAC-3' dated back to March 14th, 2020 (GenBank Accession: MW030211.1), whereas the earliest intermediating sequences with two nucleotide substitutions dated back to November 17th, 2020, which was also much later than the emergence of the mutant with the trinucleotide substitution.

Fig. 1.

Identification of two sites of trinucleotide exchange in the SARS-CoV-2 N gene.

The panels show the results of two sets of multiple sequence alignments with 25 evaluated sequences (one with the original Wuhan-Hu-1 and 24 from subsequent lineages) in each set. Sequence alignments identified two nucleotide exchange sites in three consecutive nucleotides. Both sites were located in the N gene. Considering that the sequence identity between the original Wuhan-Hu-1 and the evaluated sequence from the clade GRY (Alpha) was 99.91%, the observation of two sites of three-base exchange based on a gradual accumulation of single nucleotide substitutions is not probable. Rather, a supposition of a new mutation type, an en bloc exchange of short consecutive nucleotides, is needed to explain the observed mutations.

Table 2.

Numbers of registered sequences with each amino acid substitution in the GISAID database.

The number (n) and prevalence (%) of registered SARS-CoV-2 genome sequences collected from humans with each of the three types of amino acid substitutions are listed. The number and rate were obtained within each of the following GISAID clades and sublineages: L, GH (Beta), GR (Gamma), G, GRY (Alpha), GK (Delta), and GRA (Omicron) sublineages BA.1, BA.2, BA.5, BQ.1, BQ.1.1, and XBB.

GISAID, Global Initiative on Sharing All Influenza Data; N, nucleocapsid.

Table 3.

The Basic Local Alignment Search Tool (BLAST) search results for the earliest strains with sequences linking the original sequences and mutants with trinucleotide substitution.

A BLAST sequence search was performed on February 2nd, 2023. In the first location at 28,280-28,282 nt, the oldest confirmed mutant with trinucleotide substitution dated back to May 24th, 2020, whereas the earliest intermediating sequences with two nucleotide substitutions dated back to December 10th, 2020. In the second location at 28,881-28,883 nt, the confirmed oldest mutant with trinucleotide substitution dated back to March 14th, 2020, whereas the earliest intermediating sequences with two nucleotide substitutions dated back to November 17th, 2020. In both locations, the mutants with a trinucleotide substitution were collected much earlier than other mutants with intermediating sequences, suggesting an en bloc development of the trinucleotide substitutions.

Discussion

In the present study, the presence of a possible new type of gene mutation, an en bloc exchange of short consecutive bases, was reported at two sites in the SARS-CoV-2 N gene. The possibility of coincidental accumulation of single nucleotide substitutions or overlapping indels was ruled out in the present study by performing a comprehensive BLAST sequence search and GISAID database search for each conceivable intermediating sequence linking the original strain and mutants with trinucleotide substitutions. Consequently, the observed trinucleotide substitutions in the SARS-CoV-2 N gene may imply a novel type of mutation, which is different from previously known traditional mutations, such as point mutations, insertions, deletions, inversions, duplications, translocations, or recombinations (Gu et al. 2008; Lee et al. 2012). Currently, the exact developmental mechanisms of sequence exchanges involving dozens of consecutive nucleotides in sarbecoviruses remain uncertain (Akaishi 2022a; Akaishi et al. 2022b); however, the results of the present study imply that en bloc sequence substitutions may have played a role in sarbecovirus evolution. Further studies are warranted to elucidate the presence and roles of en bloc sequence substitutions in viruses.

This study further showed that the prevalence of each trinucleotide substitution remarkably differed between different clades of SARS-CoV-2, suggesting that trinucleotide substitutions may have played a role in the spread of the virus. Currently, profiles of mutations between different variants of concern are primarily compared based on mutations in the S1 RBD. However, mutations in other gene locations outside the S1 RBD, such as the S1 NTD or N gene, may also need to be considered. Generally, the N gene is highly conserved, and the frequency of mutation occurrence is much lower than that of the S gene (Thakur et al. 2022). A previous study that examined a recombinant SARS-CoV-2 alpha variant with cloning techniques suggested that the R203K + G204R mutation could increase viral replication and enhance the pathogenesis (Johnson et al. 2022). More specifically, the R203K + G204R mutation is located in the serin-rich domain of the N gene, and the phosphorylation level of this domain is considered to regulate nucleocapsid function via the liquid-liquid phase separation (Carlson et al. 2020). By modulating the phosphorylation level of the nucleocapsid, trinucleotide substitution may increase viral fitness with enhanced adaptation in humans.

Currently available genetic engineering technologies, including the clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated system (Jinek et al. 2012; Ran et al. 2013), seem to be unable to realize en bloc exchange of short consecutive nucleotides in a single-stranded RNA genome without performing additional complicated processes, such as preparing a double-stranded DNA sequence and exchanging a specific position. Some unknown mechanisms that realize short-sequence exchanges in single-stranded RNA molecules may exist in natural environments, including host cells. Moreover, the mutation profiles of sarbecoviruses differ significantly from those of other virus species, such as adenoviruses or influenza viruses, with different frequencies and lengths of indels (Akaishi 2022b). This fact suggests another possibility that short sequence exchange is a phenomenon specific to some viruses, and the virus genomes may encode molecules that realize such mutations. SARS-CoV-2 RNA-dependent RNA polymerases (RdRp) play a major role in the replication machinery, and replication fidelity is remarkably influenced by mutations in some non-structural proteins (Eckerle et al. 2010; Pachetti et al. 2020). Further studies are needed to determine whether the viral replication machinery is a key player in the development of trinucleotide substitutions in the N gene.

The present study has several limitations. First, we only identified exchanges of three consecutive bases. Future studies are needed to determine whether there are en bloc sequence exchanges involving > 3 consecutive nucleotides in SARS-related coronaviruses or other organisms. Another limitation was that the present study only evaluated the genome sequences of samples derived from humans, and whether the observed three-base exchange occurred only in humans or also in other animal hosts could not be determined. Furthermore, whether the developmental mechanisms of such en bloc exchange mutations are coded in the viral genome itself or are realized by the transcription machinery of the host cell remains unknown. Studies are required to determine the exact mechanisms associated with en bloc short-sequence exchange in SARS-related coronaviruses. Lastly, a research has shown how large populations, especially those with high mutation rates, can seem to fix multiple mutations simultaneously (Weinreich and Chao 2005), which is typically observed to avoid low-fitness intermediates. Similarly, studies have examined polymerase errors, and it is not uncommon for the polymerase to make mistakes with sites located close together (Drake 2007). These facts render it difficult to conclude that the observed trinucleotide substitutions in the SARS-CoV-2 N gene truly developed from an en bloc sequence exchange at once.

In summary, the present study identified two locations of trinucleotide substitutions in the SARS-CoV-2 N gene, which were difficult to explain using traditional single nucleotide substitutions and/or indels. Further studies are warranted to determine the exact mechanisms underlying the substitution of continuous nucleotides.

Acknowledgments

We gratefully acknowledge all data contributors, i.e., the authors and their originating laboratories responsible for obtaining the specimens, and their submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based.

Author Contributions

T.A. conceived, performed analyses, and drafted the manuscript. K.F. performed analyses, verified the results, and critically reviewed and revised the manuscript. T.I. supervised the study, and critically reviewed and revised the manuscript.

Conflict of Interest

The authors declare no conflict of interest.

References
 
© 2023 Tohoku University Medical Press

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC-BY-NC-ND 4.0). Anyone may download, reuse, copy, reprint, or distribute the article without modifications or adaptations for non-profit purposes if they cite the original authors and source properly.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top