SARS-CoV-2 HaploGraph: visualization of SARS-CoV-2 haplotypes spread in Japan 1

38 Since the early phase of the coronavirus disease 2019 (COVID-19) pandemic, a 39 number of research institutes have been sequencing and sharing high-quality severe acute 40 respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes to trace the route of infection in 41 Japan. To provide insight into the spread of COVID-19, we developed a web platform named 42 SARS-CoV-2 HaploGraph to visualize the emergence timing and geographical transmission 43 of the SARS-CoV-2 haplotypes. Using data from the GISAID EpiCoV database as of June 4, 44 2022, we created a haplotype naming system by determining the ancestral haplotype for 45 each wave and showed prefectural or region-specific haplotypes in each of the four epidemic 46 waves in Japan. The SARS-CoV-2 HaploGraph allows for interactive tracking of virus 47 evolution and geographical prevalence of haplotypes and aids in developing effective public 48 health control strategies during the global pandemic. The code and the data used for this 49 study are publicly available at: https://github.com/ktym/covid19/. 50


Introduction
Coronavirus disease 2019  has become the leading cause of morbidity and mortality worldwide since the end of 2019.COVID-19 is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with a genome size of 29.9 kb (Khailany et al., 2020;Morens et al., 2020).SARS-CoV-2 is highly transmissible with a broad tissue tropism, causing rapid respiratory and gastrointestinal illnesses and long-term ramifications such as myocardial inflammation, which is a significant risk for the elderly (Harrison et al., 2020).Globally, as of July 25, 2022, 572,239,451 confirmed cases of COVID-19, including 6,390,401 deaths, have been reported to WHO [World Health Organization [WHO], 2022a], which makes COVID-19 one of the deadliest infectious diseases in history, as suggested by previous reports (Morens et al., 2020;Sampath et al., 2021).COVID-19 has become an enormous threat to public health and social and economic activities in many countries, including Japan (Smallwood et al., 2022;Nicola et al., 2020).
Japan has faced six COVID-19 epidemic waves caused by different lineages of SARS-CoV-2 by June 2022 (Outbreak.info, 2023).The first COVID-19 case in Japan was confirmed on January 15, 2020, from a person who had a history of staying in Wuhan, China, shortly after the COVID-19 outbreak [Ministry of Health, Labour and Welfare (MHLW), 2020].Reports of additional cases followed, including the passengers from Wuhan via chartered flights in January and the crew members and passengers on the Diamond Princess cruise ship in February (Arima et al., 2020;Yamagishi et al., 2020).By March, the number of confirmed cases continued to increase and peaked in April (Arima et al., 2021).It was reported that the first wave was contributed by at least two introductions of distinct strains: the earlier introduction in January and February from China, and the latter in March and April mainly from Europe, of which the Spike (S) protein contains an amino acid replacement of aspartate to glycine at position 614 (D614G) (Sekizuka et al., 2020a mSphere).
The second and third epidemic waves were caused by PANGO (Phylogenetic Assignment of Named Global Outbreak, Rambaut et al., 2020) lineages B.1.1.284 and B.1.1.214 [National Institute of Infectious Diseases (NIID), 2021a], respectively.It was reported that the two Japan-specific lineages, B.1.1.284 and B.1.1.214,were derived from the B.1.1.114lineage which was mainly found in Europe from March to April 2020 (NIID, 2021a).The second wave started with an increasing number of cases in late May 2020, then peaked during late July through August, declined in September, and plateaued through the first half of October (Arima et al., 2021).The third epidemic wave began to rise in the latter half of October 2020 and has resulted in substantial COVID-19 morbidity and mortality, surpassing both the total and fatal case counts from the first two waves combined.All indicators continued to increase through December 2020 and peaked in January 2021 in the third wave (Arima et al., 2021;Outbreak.info, 2023).
The fourth wave was caused by the first variant of concern (VOC), Alpha (including B.1.1.7 and Q lineages), which was first identified in the United Kingdom (UK) in late summer to early autumn 2020 (Volz et al., 2021).The fourth wave started to increase in Japan in March 2021 and reached its peak in May 2021 (NIID, 2021a;WHO, 2022a).The epidemiological studies estimating the reproductive numbers conducted in the UK showed that the Alpha variant containing N501Y replacement in the S protein has a 35% to 100% higher transmissibility than the pre-existing SARS-CoV-2 variants (Graham et al., 2021;Davies et al., 2021;Volz et al., 2021;Leung et al., 2021).The N501Y mutation enhances the Alpha variant's affinity for angiotensin-converting enzyme 2 (ACE2), the cellular receptor, that facilitates viral entry (Luan et al., 2021).An epidemiological study in Japan also indicated that the Alpha variant had an approximately 1.9-2.3-foldhigher transmissibility than the pre-existing virus in the Japanese population (Tanaka et al., 2021).Alpha variants were not only more transmissible than pre-existing SARS-CoV-2 variants but also caused more severe illness and increased mortality (Challen et al., 2021;Grint et al., 2021;Davies et al., 2021).
The Delta variant (B.1.617.2 and AY lineages) that was first detected in India in October 2020 (WHO, 2022b;Mlcochova et al., 2021) began spreading rapidly throughout the country from July to August 2021 causing the 5th wave in Japan (Abe and Arita, 2021;Koyama et al., 2022).The AY.29 which is a sub-lineage of the B.1.617.2, containing C5239T and T5514C mutations, spread around the end of June and early July 2021 and became the predominant strain in Japan, reaching its peak in September 2021 (Abe and Arita, 2021;Koyama et al., 2022).The Delta variant lacks the N501Y, the mutation prominent in the Alpha variant, but carries several mutations within the S protein such as L452R, T478K, and P681R, which confer resistance to monoclonal antibody treatments (Gupta et al., 2021).The P681R replacement also was found to increase fusogenicity and pathogenicity that characterized the specific feature of the Delta variant (Saito et al., 2022).Another mutation within the N protein, R203M, increases viral mRNA delivery and expression, allowing the Delta variant to produce >50-fold more viral particles (Syed et al., 2021).Studies reported that Delta was 63%-167% more transmissible than Alpha in the USA and showed 1.4 times higher transmissibility than Alpha in Japan (Earnest et al., 2022;Ito et al., 2021).
The sixth wave in Japan was caused by the Omicron variants, mainly by BA.1 (i.e., B.1.1.529.1) (Iketani et al., 2022;Desingu et al., 2022).The first case of Omicron infection was reported in "Japan quarantine" on November 30, 2021, from a traveler from Namibia (MHLW, 2021).Following the first case, the rapid spread of the Omicron variant caused the largest increase of COVID-19 cases in Japan and reached its peak in February 2022 (Outbreak.info, 2023).The major strain of the sixth wave shifted from BA.1 to BA.1.1 up to March 2021, which was later replaced by BA.2, a sister lineage of BA.1 (NIID, 2022).The Omicron (BA.1) variant containing over 30 amino acid replacements in the S protein, showed a higher affinity for ACE2 compared with Delta.A marked change in antigenicity increased Omicron's evasion of therapeutic monoclonal and vaccine-elicited polyclonal neutralizing antibodies after two doses (Meng et al., 2022;Okumura et al., 2022).Another study suggested that Omicron has spread more rapidly than the Delta variant in several countries and revealed that Omicron showed lower fusogenicity and attenuated pathogenicity compared to Delta and ancestral SARS-CoV-2 (Suzuki et al., 2022).
In this study, to obtain key insights into the spread of COVID-19 in Japan in more detail, we created a website visualizing the transmission of SARS-CoV-2 haplotypes named SARS-CoV-2 HaploGraph (https://ktym.github.io/covid19/).The SARS-CoV-2 HaploGraph utilized major SARS-CoV-2 lineages for each epidemic wave that were obtained by conducting phylogenetic analyses of SARS-CoV-2 genomes sampled in Japan.The reference haplotype for each lineage was identified, and their mutations were compared with those of other haplotypes to trace the transmission of COVID-19, taking into account the sampled date, location, and the number of sequences stored in the global initiative on sharing all influenza data (GISAID) EpiCoV database as of June 4, 2022 (https://www.gisaid.org;Khare et al., 2021).Next, the haplotypes of the isolates were determined by comparing their genomic variations to the reference haplotypes of the major lineages of the six epidemic waves in Japan.For each haplotype, we calculated its occurrence and observation period for each prefecture of Japan.The SARS-CoV-2 HaploGraph aids in understanding the dynamics of the SARS-CoV-2 variants spreading in Japan.

Results 152
Japan-related SARS-CoV-2 genomes were divided into the following three groups 153 using the GISAID metadata: 1) "Domestic": genomes sampled from patients infected in 154 Japan (284,819 genomes); 2) "Quarantine": genomes sampled from patients infected 155 outside of Japan (10,644 genomes); and 3) "International": genomes sampled outside Japan 156 from patients infected in Japan (92 genomes).For the entries from the COVID-19 outbreak 157 on the cruise ship -Diamond Princess (Sekizuka et al., 2020b), 72 entries were assigned to 158 the "Quarantine".From these data, we used genomes without any undetermined nucleotide 159 base.The numbers of genomes categorized as "Domestic", "Quarantine", and "International" 160 were 251,761, 9,199, and 33 (88.4%, 86.4%, and 35.9% of all genomes), respectively.161 The haplotype analysis was conducted for the major variants of the six epidemic 162 waves in Japan.The results for the first five waves are summarized in The "total genomes" shows the number of GISAID entries after filtering the genomes with 170 any undetermined sequence.The number of genomes of the dominant PANGO IDs of the 171 five waves is listed by their sampled date.Note that the genomes of particular PANGO IDs 172 were detected before and after the wave period in which the particular PANGO ID was 173 predominant.The percentage in the parenthesis indicates the proportion of the genomes of 174 the dominant PANGO ID in the "total genomes" sampled in the corresponding wave period.175 Entries without month, date, or prefecture information are not included in these statistics.An 176 asterisk (*) indicates any numbers in the sub-lineages of the given PANGO IDs.177

1st wave in Japan 178
To obtain the major haplotype in the first wave, we examined the GISAID entries 179 sampled from January to June 2020; the period covers from the beginning until the end of 180 the first wave (Arima et al., 2021).Note that the wave period definition applied in this 181 analysis is slightly different from that noted in Table 1.In total, 4,951 entries were classified 182 into the 3 categories: "Domestic" (4,786 genomes), "Quarantine" (159 genomes), and 183 "International" (6 genomes).We further examined the PANGO IDs of the first wave and 184 found 69, 4, and 18 unique PANGO lineages in each of the "Domestic", "Quarantine", and 185 "International" categories, respectively.After merging the three categories, the number of 186 unique PANGO lineages became 77.The PANGO ID with the largest number of entries 187 (3,057) was B.1.1, which accounted for 61.8% of 4,951 all Japan-related first-wave entries, 188 followed by 7.6% (374) of B.1.1.48.We focused on the B.1.1 and its sub-lineages and 189 investigated the haplotypes of the Japan-related B.1.1 entries of the first wave.For those 190 entries, the number of SNVs ranged from 4 to 17. Two GISAID entries (GISAID ID: EPI_ISL_684814 and EPI_ISL_685167) having the smallest number of SNVs in the "Domestic" B.1.1 lineage contain the same combination of 4 SNVs, of which sampling dates are April 6, 2020, and April 14, 2020, respectively.Other "Domestic" B.1.1 and the sister lineages also contained those 4 SNVs; however, the 4 SNVs were also found in 90.2% (47,138 entries) of non-Japan B.1.1 samples.Thus, no Japan-specific SNVs were found in the "Domestic" B.1.1 lineage.Therefore, it was impossible to distinguish between the B.1.1 strains of the first wave sampled in Japan and those from outside the country using their SNVs.For this reason, SARS-CoV-2 haplotypes of the first wave in Japan were not visualized in the HaploGraph.

Figure 1. SARS-CoV-2 HaploGraph website visualizes the haplotype "1" and
haplotype "1.1" of the second epidemic wave in Japan.On the top menu bar, from the "Change dataset", the dataset for an epidemic wave can be selected.Two datasets are available for each wave: "All" and "50%"."All" contains all haplotypes, and "50%" includes all haplotypes up to the smallest frequency haplotype whose sum exceeds 50% when all frequencies are ordered from highest to lowest.Then, the "Select haplotype(s)", one or more haplotype(s) can be selected: "1" and "1.1" of the second wave shown in this figure.The X-axis or Y-axis indicates the sampling date or prefecture.In each prefecture (47 prefectures in total), the dots indicate the earliest or the latest sampled dates.The dots are connected with a line, the thickness of which is correlated with the number of entries sampled in that prefecture.A schematic representation of the SARS-CoV-2 genome is shown in a colored bar.On the genome bar, the gray diamonds indicate the genomic locations of the mutations contained in the reference haplotype of the selected wave.The grey pins above the genome bar indicate the schematic genomic location of the mutation(s) contained in the selected haplotype(s): "1.1" shown in the grey dots and line in this figure indicates SNVs compared to the reference haplotype of each epidemic wave.The 6 boxes under the genome bar describe: 1) "Haplotype", the wave and haplotype name; 2) "Prefecture", the prefecture where the haplotype was sampled; 3) "First day", the earliest date, and 4) "Last day", the latest date of the entries sampled in the prefecture; 5) "Duration", the time span by day(s); and 6) "Total", the total number entries of the haplotypes sampled in the prefecture.These particulars change when a different prefecture within its "Duration" period is touched by the cursor.When the "Trace" option is on, the earliest entries of different prefectures will be connected by a line in time order.However, it doesn't mean that those earliest entries were related to the spread of the haplotype.In this figure, the earliest or latest entry of the haplotype "1" is surrounded by an open square or an open circle of the same color as the haplotype.Note that this visualization is not by the function of the viewer.The colors for displaying the selected haplotypes are adopted randomly by the viewer.

4th wave in Japan -Alpha
The major variant in the fourth epidemic wave of Japan was reported to be the Alpha (B.1.1.7)(Hirotsu et al., 2021b); however, there were no reports on Japan-specific Alpha variants to our best knowledge.To trace the major Japan-specific Alpha variants, we compared the 51,863 Japan-related Alpha (51,859 "Domestic" and 4 "International" entries) with 1,117,535 non-Japan-related Alpha (1,117,247 non-Japan and 288 "Quarantine" entries) and revealed 4 nucleotide substitutions which were almost exclusively observed in Japan Alpha variants: G17019T (55.5%),C26464T (25.1%),T23659C (24.8%), and C11173T (11.2%).The 2nd and the 3rd SNVs (C26464T and T23659C) co-occurred with the 1st SNV (G17019T) in 28,761 genomes, while the C11173T in 5,783 genomes was independent.Thus, for the fourth wave, we identified two reference haplotypes for the two major independent lineages (34,544 genomes) which accounted for 66.61% of all Japan Alpha genomes (51,863).Among the total 11,788 haplotypes of the Japan Alpha variants, 374 the largest one was the reference haplotype "1" containing G17019T (1,648 entries) and the 375 third largest one was the reference haplotype "2" containing C11173T (722 entries).The 376 earliest and latest entries of haplotype "1" were sampled on March 8, 2021, in Hyogo and on 377 July 1, 2021, in Hyogo; and the earliest and latest entries of haplotype "2" were sampled on 378 February 27, 2021, in Hokkaido and on July 9, 2021, in Fukuoka (Table 4 and Figure 5  The second largest haplotype (1,319 entries) was "1.265.2",obtaining C26464T and T23659C substitutions based on haplotype "1", with the earliest entry sampled on March 13, 2021, in Osaka, and the latest entries sampled on July 20, 2021, in Kanagawa and Ibaraki.
The fourth-largest haplotype (587 entries) was "2.112.5",obtaining A5871G and C2710T substitutions based on haplotype "2", with the earliest entry sampled on June 29, 2021, in Tokyo and the latest entry sampled on September 10, 2021, in Osaka (Table 4 and Figure S3).The top 4 largest haplotypes were sampled in various prefectures as shown in Figure 5 and Figure S3.
In addition, there were 3 haplotypes showing prefectural preference: such as for the from July to September of 2022 and was reported to be a Japan-specific Delta variant (Abe 415 and Arita, 2021;Koyama et al., 2022).The AY.29 is defined by the two nucleotide 416 substitutions: C5239T (a synonymous substitution in ORF1ab) and T5514C (a 417 nonsynonymous substitution causing V1750A in ORF1ab) (Abe and Arita, 2021).A sub-418 lineage of AY.29 called AY.29.1 was further identified, which obtained a nonsynonymous 419 substitution G22081T (Q173H in S protein) compared to AY.29.Another sub-lineage of 420 AY.29 called AY.29.2 was found and characterized by a nonsynonymous substitution 421 A22803G (Q414R in S protein) (GitHub, 2021) (Table 5 and Figure 6).422 423 The genome isolated at the Airport Quarantine Station of Japan, EPI_ISL_1927416, was reported as the ancestor of the AY.29 lineage (Abe and Arita, 2021;Koyama et al., 2022), which we reconfirmed by comparing nucleotide substitution patterns (Table S2) as well as phylogenetic analysis using more recent GISAID data (Figure S5).
Among all 33,917 haplotypes of the fifth wave, 21.26% showed the largest sample size in Tokyo.The latest entry of the "Domestic" delta variant was EPI_ISL_12036503, sampled in Tokyo on March 10, 2022.It belongs to the haplotype "q.2", obtaining 16 SNVs compared to the haplotype "1" (54 SNVs compared to the Wuhan-Hu-1 reference genome) (Table 5 and    Figure 7).During the period shown in Figure 8, 134 sub-lineages of Omicron were detected in Japan.The top five of them were BA.1.1.2(53.9 %), BA.1.1 (12.0 %), BC.1 (also as BA.1.1.1.1)(5.9 %), BA.2 (4.8 %) and BA.2.3.1 (4.0 %), respectively.In addition, other sublineages accounted for 19.4% of the total.There were 6 sub-lineages (BA.2.3, BA.2.10, BA.2.29, BA.1.15,BA.2.3.13, and BA.2.24, in the ascending order of the number of entries) in which more than 1000 entries were detected.There were also 13 sub-lineages in which more than 100 entries were detected.Three of the top five sub-lineages, BA.1.1.2,BA.1.1,and BA.2, were first detected as Japan quarantine strains.Most of the other Omicron sublineages had also been transmitted from overseas and were spreading in Japan.In contrast, the remaining two of the top 5, BC.1 (also as BA.The weekly trends of the Omicron sub-lineages showed that the largest sub-lineage BA.1.1.2became predominant after the 52nd week of 2021 and peaked in the 3rd week of 2022 when it accounted for 73.4% of all entries.From the 12th week of 2022, the entry number of BA.2 and its sub-lineages became larger than that of BA.1 and its sub-lineages.
Thus, during the sixth wave in Japan, there was a shift from the BA.1 and its sub-lineages to the BA.2 and its sub-lineages.

Discussion
The high sequence quality of SARS-CoV-2 genomes enabled us to analyze their nucleotide substitutions and to trace the spread of COVID-19.Here we described the dynamics of SARS-CoV-2 haplotypes in Japan using SARS-CoV-2 HaploGraph, a new web resource for visualization.For the HaploGraph, it is important noting the followings: 1) we used the GISAID database, which restricts users from providing detailed information on nucleotide substitutions which are derived from the database; 2) not all genomes of the SARS-CoV-2 that spread in Japan were sampled, sequenced, and stored in GISAID; 3) quality of genomes could differ depending on the sequencing laboratories; and 4) recombination of variants were not considered.For the first point, according to the GISAID term of use, the data stored in the GISAID should not be displayed or accessed through a separate portal or a network of institutions (https://www.gisaid.org/registration/terms-of-use/; Arita 2021).This policy makes it difficult to provide details on the mutations of the SARS-CoV-2 variants on our SARS-CoV-2 HaploGraph browser.Indeed, the HaploGraph browser is based on the processed statistics of GISAID information that is available at the GitHub website, detail information of which cannot be provided.For the second point, the sampling of SARS-CoV-2 genomes could also be biased depending on the area and dates.Therefore, when and where a variant was detected for the first time is not necessarily the location and date of the first emergence of the variant.On the other hand, the number of COVID-19 patients and the number of sequenced SARS-CoV-2 genomes were well correlated for the periods of the 1st to the 5th waves in Japan (Figure 9; Pearson's correlation coefficient, r = 0.962, P-value < 0.001).This observation suggests that haplotypes visualized by "HaploGraph" could roughly represent the transmission of SARS-CoV-2 in Japan, although more than 90% of SARS-CoV-2 genomes sampled from COVID-19 patients were not sequenced (Figure 9).For the third point, in this study, we used the "complete genomes"genomes without any undetermined and mixed nucleotides-for the identification of all SNVs in each genome.Nonetheless, the sequencing quality of Japan on SARS-CoV-2 genomes is considerably high compared to that of the overseas.For instance, the proportion of the sequences that contain 0 or 10 undefined nucleotides reached 89.4% or 91.1% in the entries sequenced in Japan, while it was 36.0%or 48.0% in the entries sequenced in other countries.This was one important factor that made this study possible with reliable results.
For the fourth point, some variants may emerge though recombination events.It is difficult to distinguish whether mutation(s) emerged by recombination or parallel evolution when the number of mutations is limited.Indeed, many recombination events are reported for the SARS-CoV-2 genomes, particularly for the Omicron variants (Ou et al., 2022).Since various nucleotide substitutions accumulated in the genomes of Omicron variants, recombination of Omicron variants could be easily detected.Before the Omicron strain arose, one of the few examples of recombinant strains reported was the XC variant, a recombination of the Delta and Alpha mutants detected in Japan (Sekizuka et al., 2022).The lack of consideration of recombination is one of the limitations of this study.In this study, we did not analyze R.1 lineage, which was not the main cause of the fourth epidemic wave of Japan, but spread on a considerably large scale (originally assigned as B.1.1.316).The R.1 lineage contains potential escape mutations in the Spike (S) protein receptor-binding domain (E484K) and N-terminal domain (W152L) (Hirotsu et al., 2021a).It was reported that the R.1 was originated from the European strain B.1.1.114from March to April 2020 by acquiring 13 mutations (NIID, 2021b), although there was no evidence showing those mutations were obtained in Japan.The number of R.1 entries sampled in each wave period is summarized in Table S3.It shows that 87.2% of R.1 entries were sampled during the period of the fourth wave.It explains one of the reasons why the percentage of the Alpha entries (PANGO ID: B.1.1.7 and Q.*) analyzed in our study was relatively low in the fourth wave period compared to that of the dominant lineages in other wave periods.
Our study clearly indicated that, although major SARS-CoV-2 lineages differed depending on epidemic waves in Japan, some remaining variants existed in the next or in the further epidemic waves (Tables 2 -5).Such haplotypes were not traceable for each mutation step (Tables 2 -5).This observation suggests that the major variants did not disappear after the epidemic wave and continued to remain in small numbers.This observation was also suggested by other studies (Nabeshima et al., 2021;Ode et al., 2022).
Our haplotype analysis also showed the long-lasting transmission of haplotypes without mutation.For example, the original haplotype of AY.29, which did not have a single mutation, continued to circulate in Japan for as long as 5 months (Figure 6 and Table 5).It is known that SARS-CoV-2 shows relatively lower mutation rate mutation partially owing to nsp14 (Robson et al., 2020, Peacock et al., 2021), although amino acid replacement(s) on nsp14 may affect SARS-CoV-2 mutation rates (Takada et al. 2022).Further, a substantial selection of mutations associated with immune escape and binding affinity to the receptor has characterized genomes of SARS-CoV-2, the trend of which is expected to continue (Kistler et al. 2022;Carabelli et al. 2023).Therefore, it remains important to monitor genomic mutations of SARS-CoV-2.
COVID-19 continues to spread worldwide, and over 14.8 million SARS-CoV-2 genomes were registered in GISAID as of 1/31, 2023 (https://www.gisaid.org;Khare et al., 2021).However, due to the GISAID term of use (https://www.gisaid.org/registration/terms-ofuse/), the non-registered general public cannot use the genome sequences and the related data stored in the GISAID.In addition, even for the registered researchers it is not easy to use the GISAID data due to its huge size and inefficient compression (Kryukov et al., 2022).
Thus, the GISAID data from which the valuable information can be derived, including the transmission patterns, may remain difficult to access.As shown in this study, SARS-CoV-2 HaploGraph allows users to understand a graphical representation of the transmission of COVID-19 in Japan based on SARS-CoV-2 genome information without the need to perform detailed GIASID data analysis.The methodology and platform of SARS-CoV-2 HaploGraph can also be applied to future genomic studies of other infectious diseases.1.

Haplotype Naming System
We obtained the variations for each SARS-CoV-2 genome stored in GISAID by aligning the genome sequences to the reference genome Wuhan-Hu-1 (GenBank ID: NC_045512) using MAFFT version 7.490 with default options (Katoh et al., 2019).Based on the pairwise alignment, mutations on different nucleotide sites were identified as substitution, insertion, or deletion.We then obtained the haplotype of each genome sequence by classifying the single nucleotide variants (SNVs) contained in the genome.Note that continuous indels are treated as one SNV in this study because they could be caused by a single mutation.
For each SARS-CoV-2 epidemic wave in Japan, the reference haplotype was determined as the sequence with an early sampling date with the minimum number of SNVs.
The ID for each haplotype was assigned as follows.First, the reference haplotype was named "1" and if there were more than one reference haplotype in a lineage, we increased the number (i.e., 2, 3, …).Then, when a haplotype acquires one SNV, a number is added and connected to the first number with a dot (i.e., ".").For example, "1.N" (where N is any number) indicates the acquisition of one specific mutation from haplotype 1. Backward mutations when an SNV reverts back to the Wuhan-Hu-1 wild-type haplotype were also counted.In such cases, the haplotype was started by a lowercase "d" before the backward mutation.For example, a mutation "dG26849T" indicates the haplotype contains a backward mutation "G" at position 26849 (i.e., the genotype is the same as that of Wuhan-Hu-1), although the reference haplotype of the lineage has the mutation "T" at the same position.In cases where there were multiple candidates for an immediate ancestor of a haplotype, we selected the oldest haplotype (the one sampled on the earliest date), and the other haplotype(s) are concatenated with a pipe (i.e., "|").If an immediate descendant with a single mutational difference was not found, a two-step forward descendant was searched.When it was found, an "x" was used to indicate the unknown intermediate state.For example, "1.2.x.1" is a haplotype with two SNVs from haplotype 1.2.If two intermediate states were not found, the haplotype would start with a letter in alphabetical order according to the number of SNVs compared to the reference haplotype of the lineage.For example, a haplotype "c.1" has three SNVs compared to the reference haplotype and its ancestors are not traceable within 1 or 2 SNVs till the reference haplotype.

Phylogenetic analysis
Multiple alignments of SARS-CoV-2 genomes were generated using the following two methods: 1) if the number of genomes is over 50,000, we applied minimap2 version 2.24 (Li et al., 2018) to construct pairwise alignment to the Wuhan-Hu-1 reference sequence, then applied gofasta version v1.0.0 (GitHub, 2022) to build a multiple alignment based on the pairwise alignments; 2) if this is not the case, we used MAFFT version 7.490 (Katoh et al., 2019) with --6merpair and --addfragments options to construct the multiple alignments.
Using the multiple alignment, we constructed a maximum-likelihood-based phylogenetic tree using IQtree version 2.1.3(Nguyen et al., 2015) with the following options: -m GTR+I+G -B 1000.The phylogenetic tree was visualized by ggtree version 2.4.1 (Yu et al., 2020).

Haplotype visualization
To visualize the mutation and propagation of each SARS-CoV-2 haplotype, we developed the HaploGraph web application.HaploGraph is designed for tracing when and where each haplotype was firstly or lastly detected, when and where the haplotype was observed during the wave period, and for how long the haplotype was recorded.These haplotypes are distinguished by the accumulated single nucleotide variations as described in the previous section; they can be tracked from the ancestral haplotypes to the descendants.
For each epidemic wave, users can choose one or more haplotypes from the list, select a region, or specify a range of dates of interest on the user interface of the HaploGraph (please see Figure 1 as well).Toggle buttons enable users to show 1) "trace" for indicating propagation of the identical haplotypes, and 2) "duration" for presenting how long the haplotype was observed.The source code of the HaploGraph is available at the GitHub repository (https://github.com/ktym/covid19/)that is implemented with the D3.js (https://d3js.org/) and Bootstrap (https://getbootstrap.com/)libraries.

Figure 3 .
Figure 3. HaploGraph visualization for the haplotypes "1" and "1.7.21.x.3.x.5" of the third epidemic wave.Dots and lines in red and green indicate the reference haplotype "1" and the largest haplotype "1.7.21.x.3.x.5", respectively.The earliest or latest entry of haplotype is indicated by a square or a circle of the corresponding color, respectively.

Figure 5 .
Figure 5. HaploGraph visualization for the haplotypes "1" and "2" of the fourth epidemic wave.Dots and lines in red and yellow indicate the largest haplotype "1" and the third largest haplotype "2", respectively.The earliest or latest entry of each haplotype is surrounded by a square or a circle of the corresponding color.The sampled prefectures and dates of the two haplotypes appear to overlap in a wide range.

Figure 8 .
Figure 8. Weekly trends of the Omicron sub-lineages in Japan.The X-axis represents the time span from the 47th week of 2021 to the 21st week of 2022.The Y-axis represents the percentage of Omicron sub-lineages in Japan.Each color represents an Omicron sublineage which is described on the right.Note that the Omicron sub-lineages having less than 100 entries are all filtered into the "Others" category.

Figure 9 .
Figure 9. Weekly counts for the number of SARS-CoV-2 genomes and COVID-19 patients in Japan.The X-axis represents the time span from the first to the fifth waves in Japan (i.e., January 1, 2020, to December 16, 2021).The light blue and blue colors in the background depict the five wave periods.The left and right Y axes represent the number of genomes stored in GISAID (shown in red) and the number of COVID-19 patients (shown in blue) obtained from OurWorldInData.org,respectively.Note that the scale of the left and right Y axis is 10 times different.Those numbers were counted weekly.

Table 5 . The representative haplotypes of the fifth wave that showed distinctive 424
Dataset SARS-CoV-2 genome sequences and their annotation information used in this study were downloaded from the GISAID EpiCoV database (https://www.gisaid.org)as of June 4, Japan was obtained from the documents of the Ministry of Health, Labour and Welfare (MHLW) of Japan (MHLW, 2022) as summarized in Table