Next-generation sequencing analysis with a population-specific human reference genome

Tomohisa Suzuki; Kota Ninomiya; Takamitsu Funayama; Yasunobu Okamura; Shu Tadaka; the Tohoku Medical Megabank Project Study Group; Kengo Kinoshita; Masayuki Yamamoto; Shigeo Kure; Atsuo Kikuchi; Gen Tamiya; Jun Takayama

doi:10.1266/ggs.24-00112

ABSTRACT

Next-generation sequencing (NGS) has become widely available and is routinely used in basic research and clinical practice. The reference genome sequence is an essential resource for NGS analysis, and several population-specific reference genomes have recently been constructed to provide a choice to deal with the vast genetic diversity of human samples. However, resources supporting population-specific references are insufficient, and it is burdensome to perform analysis using these reference genomes. Here, we constructed a set of resources to support NGS analysis using the Japanese reference genome, JG. We created resources for variant calling, variant effect prediction, gene and repeat element annotations, read mappability and RNA-seq analysis. We also provide a resource for reference coordinate conversion for further annotation enrichment. We then provide a variant calling protocol with JG. Our resources provide a guide to prepare sufficient resources for the use of population-specific reference genomes and can facilitate the migration of reference genomes.

MAIN

Next-generation sequencing (NGS) studies have identified thousands of genes associated with Mendelian diseases and have supported the diagnosis of many patients worldwide (Wright et al., 2018; Bamshad et al., 2019). Due to technological advances and substantial cost reductions, NGS analyses are now routinely used for basic research in human genetics and cell biological analyses, and as a clinical tool (Koboldt et al., 2013; Goodwin et al., 2016).

The reference human genome plays an essential role in NGS analyses (DePristo et al., 2011; Van der Auwera et al., 2013). In clinical analysis, the raw sequence reads must be analyzed with the appropriate reference genome to prepare for downstream analysis, and variant sites can be defined only after comparing the sequence between the reference and the reads from the tested sample. Some studies have shown that different reference sequences yield different results. For example, the difference between using GRCh37 and GRCh38 on germline variant identification amounts to 1.5% and 2.0% in single-nucleotide variants and insertions/deletions, respectively (Li et al., 2021). In other research, NOTCH2NLC repeat expansion, which is associated with neuronal intranuclear inclusion disease, was difficult to detect without the newer GRCh38 because the region of this gene had not been correctly assembled in previous assemblies (Sone et al., 2019). Therefore, the selection of the reference requires careful consideration.

The most commonly used reference genomes today are GRCh37/hg19 and GRCh38/hg38 (Li et al., 2021). They have been maintained and updated by the Genome Reference Consortium (GRC) (Church et al., 2011; Schneider et al., 2017). Although they are indispensable, there are potential problems for some populations. The genetic information used to build the reference genome is biased toward African and European ancestries (Green et al., 2010; Popejoy and Fullerton, 2016; Sirugo et al., 2019). In addition, while multiple donors participated in the project, the reference genome inevitably carried over rare or individual-specific variants, which led to confusion in the interpretations of the variants’ effects (Magi et al., 2015). Because of these biases, GRC references are inadequate for certain populations. Accordingly, several population-specific reference genomes have recently been constructed to provide a choice to deal with the vast genetic diversity, including Korean (Cho et al., 2016; Seo et al., 2016), Chinese (Shi et al., 2016; Du et al., 2019; Lou et al., 2022), Swedish (Ameur et al., 2018), Egyptian (Wohlers et al., 2020), Tibetan (He et al., 2020) and Japanese (Takayama et al., 2021).

Despite the various options for reference genomes, GRC references are still in wide use today. One of the reasons for the slow migration of references is the lack of supporting resources for analysis with population-specific references. While a wealth of knowledge and well-organized resources exist for GRCh37, they cannot be directly applied to other reference genomes due to different coordinate systems and gene positions. It is therefore necessary to transfer information between references using a process known as liftover, or else the resources would have to be recreated, which is very time consuming and costly.

In this work, we construct a set of resources to facilitate NGS analysis with the Japanese reference genome version 2.1.0 (JG2.1.0). We also demonstrate a variant calling pipeline with JG as an example of NGS analysis using a population-specific reference genome.

We have constructed the set of resources described in Table 1 to support NGS analysis with JG2.1.0. These resources contain three types of files: essential resource files needed for liftover and variant calling, a genetic annotation and variant effect prediction database, and index files for RNA-seq data alignment. The resources are provided at the locations listed in Table 1, and an overview of the resource creation workflow is shown in Fig. 1. First, we created chain files, which are required for liftover annotations between genome assemblies, using minimap2-2.24 (Li, 2018, 2021) and transanno v0.24 based on GRCh38. A gene annotation file in GFF3 was lifted over from GRCh38-based GFF3 using Liftoff v1.6 (Shumate and Salzberg, 2021) and was converted to GTF format using GffRead v0.12.7 (Pertea and Pertea, 2020). A refFlat format file was created from the GFF3 file using gff3ToGenePred (the UCSC Genome Browser command line tool). Then, using these files, useful tools and databases for variant analysis were lifted over to JG2.1.0. The Genome Analysis ToolKit (GATK) (DePristo et al., 2011) Resource Bundle, a collection of standard files for working with resequencing data, was created by lifting over using minimap2 and transanno. The cytoBand file based on GRCh38 was lifted over to JG2.1.0 using LiftOver (Hinrichs et al., 2006). CytoBand files are used to draw ideograms of chromosomes in genome viewers, such as Integrative Genomics Viewer (Thorvaldsdóttir et al., 2013). SnpEff and ANNOVAR databases were created using SnpEff v5.1d (Cingolani et al., 2012) and ANNOVAR ver.2019Oct24 (Wang et al., 2010), respectively. The dbNSFP database (Liu et al., 2020) was created based on v4.3 academic branch (v4.3a) using minimap2 and transanno. ClinVar databases were transferred using minimap2 and transanno from GRCh37 and GRCh38, respectively. These databases, which contain a variety of annotation information and variant effect estimates, are often used for variant analysis. In addition, we used RepeatMasker 4.1 to annotate repeat elements. We created genome mappability resources. Genome mappability, also called (k, e)-mappability, is the uniqueness of k-mers for each position, allowing for up to e mismatches. We calculated the (30, 2)-, (36, 0)-, (36, 2)-, (24, 0)-, (24, 2)- and (50, 2)-mappability of JG2.1.0 using Genmap (ver. 1.3.0) (Pockrandt et al., 2020). Besides, we prepared a pre-built index of JG so that one can easily calculate arbitrary (k, e)-mappability using Genmap. We also built two indices for HISAT2 2.2.1 (Kim et al., 2019) and STAR 2.7.11a (Dobin et al., 2013) that are used to align RNA-seq data.

Table 1. JG2.1.0-based resources

Resource	Required data	Software	Brief procedure	JG-based resource files (available at)
Basic resources for variant calling using JG
chain file	jg2.1.0.fa GRCh38.fa	1. minimap2 (ver. 2.24) 2. transanno (v0.24)	1. FASTA (ref, JG) → PAF 2. PAF → chain (JG)	GRCh38_to_JG2.1.0.chain (https://doi.org/10.5281/zenodo.10026166)
GFF3 file	jg2.1.0.fa GRCh38.fa gencode.v39.primary_assembly.annotation.gff3.gz	Liftoff (ver. 1.6)	GFF3 (ref), FASTA (ref, JG) → GFF3 (JG)	GRCh38_to_JG2.1.0.gff3 GRCh38_to_JG2.1.0.gff3.idx (https://doi.org/10.5281/zenodo.10026166)
GTF file	GRCh38_to_JG2.1.0.gff3	Gffread (v0.12.8)	GFF3 (JG) → GTF (JG)	GRCh38_to_JG2.1.0.gtf (https://doi.org/10.5281/zenodo.10026166)
refFlat file (GenePred)	GRCh38_to_JG2.1.0.gff3	Gff3ToGenePred	GFF3 (JG) → refFlat (JG)	GRCh38_to_JG2.1.0.refflat (https://doi.org/10.5281/zenodo.10026166)
GATK Resource Bundle	jg2.1.0.fa human_g1k_v37.fasta (b37) ucsc.hg19.fasta (hg19) Homo_sapiens_assembly38.fasta (hg38) GATK Resource	1. minimap2 (ver. 2.24) 2, 3. transanno (v0.24)	1. FASTA (ref, JG) → PAF 2. PAF → chain 3. GATK Resource (ref), FASTA (ref, JG), chain → GATK Resource (JG)	jg2.1.0-GatkResourceBundle_b37.zip jg2.1.0-GatkResourceBundle_hg19.zip jg2.1.0-GatkResourceBundle_hg38.zip (https://doi.org/10.5281/zenodo.10035718)
JG-based genetic annotation and variant effect prediction database
cytoBand file	cytoBand.txt (based on GRCh38) GRCh38_to_JG2.1.0.chain	LiftOver	cytoBand (ref), chain → cytoBand (JG)	jg2.1.0-cytoBand.txt (https://doi.org/10.5281/zenodo.10032399)
SnpEff database	jg2.1.0.fa GRCh38_to_JG2.1.0.gff3	SnpEff (v5.1d)	FASTA (JG), GFF3 (JG) → SnpEff (JG)	jg2.1.0-snpEffectPredictor.bin (https://doi.org/10.5281/zenodo.10032401)
ANNOVAR database	jg2.1.0.fa GRCh38_to_JG2.1.0.refflat	ANNOVAR (ver. 2019Oct24)	FASTA (JG), refFlat (JG) → ANNOVAR.fasta (JG)	jg2.1.0-ANNOVAR_GRCh38.fa.gz (https://doi.org/10.5281/zenodo.10032403)
dbNSFP database	jg2.1.0.fa GRCh38..fa dbNSFP 4.3a database	1. minimap2 (ver. 2.24) 2, 3. transanno (v0.24)	1. FASTA (ref, JG) → PAF 2. PAF → chain 3. dbNSFP (ref), FASTA (ref, JG), chain → dbNSFP (JG)	jg2.1.0-dbNSFP4.3a.zip (https://doi.org/10.5281/zenodo.10032408)
ClinVar database	jg2.1.0.fa GRCh37.fa GRCh38.fa ClinVar database	1. minimap2 (ver. 2.24) 2, 3. transanno (v0.24)	1. FASTA (ref, JG) → PAF 2. PAF → chain 3. ClinVar (ref), FASTA (ref, JG), chain → ClinVar (JG)	jg2.1.0-ClinVar_GRCh37.zip jg2.1.0-ClinVar_GRCh38.zip (https://doi.org/10.5281/zenodo.10026338)
Gene Mappability	jg2.1.0.fa	1, 2. Genmap (ver. 1.3.0)	1. FASTA (JG) → Pre-built index 2. Pre-built index → (k, e)-mappability	jg2.1.0-Genmap_index.zip (https://doi.org/10.5281/zenodo.10032300) jg2.1.0-Genmap_k_e.zip (https://doi.org/10.5281/zenodo.10032173)
RepeatMasker annotation	jg2.1.0.fa	RepeatMasker (ver. 4.1)	FASTA (JG) → RepeatMasker	jg2.1.0-repeat-masker.out.gz (https://jmorp.megabank.tohoku.ac.jp/datasets/tommo-jg2.1.0-20211208/files/jg2.1.0-repeat-masker.out.gz)
JG-based index files for RNA-seq data alignment
HISAT2 index (HFM index)	jg2.1.0.fa	HISAT2 (ver. 2.2.1)	FASTA (JG) → HFM index	jg2.1.0-HISAT2.zip (https://doi.org/10.5281/zenodo.10521933)
STAR index	jg2.1.0.fa GRCh38_to_JG2.1.0.gtf	STAR (ver. 2.7.11a)	FASTA (JG), GTF (JG) → STAR index	jg2.1.0-STAR.zip (https://doi.org/10.5281/zenodo.10032471)

Abbreviations: ref, reference genome; JG, Japanese reference genome.

Fig. 1. Overview of the creation of JG-based resources. Various annotation and database files were lifted over to JG2.1.0, and resources for repeat element annotation, read mappability and RNA-seq analysis were created.

Next, following the GATK Best Practices workflow (Van der Auwera et al., 2013), we developed a practical variant calling protocol using JG resources. The test dataset was subsampled from the Han Chinese trio reference samples from the Genome in a Bottle consortium (Wang et al., 2019) to reduce computational time and resources. Since paired-end sequencing was used in the analysis, the read data were divided into R1 and R2. We used bwa-mem2 v2.2.1 (Vasimuddin et al., 2019), samtools-1.16.1 (Li et al., 2009), GATK tools 4.3.0.0 (DePristo et al., 2011), python 3, SnpEff and SnpSift 5.1d (Cingolani et al., 2012) and the Java runtime environment (Java 11 or later) in this protocol. The actual procedure and command lines are shown in the Supplementary Information and Supplementary Fig. S1. The software and databases used in this work are available from the websites listed in Supplementary Table S1.

As the technologies have improved and costs have decreased markedly, NGS analysis is increasingly being used for basic research in human genetics and clinical diagnostics (Koboldt et al., 2013; Goodwin et al., 2016). Adequate resources and clear manuals are needed, but in practice, except for JG, such an ecosystem is not yet sufficient. Here, we have constructed resources for NGS analysis with the Japanese reference genome and provided a guide to prepare resources that are sufficient for the use of population-specific reference genomes and that can facilitate the migration of reference genomes. Furthermore, we demonstrated a practical variant calling procedure with trio sequencing data using JG-based resources.

Recently, several population-specific genome assemblies have been constructed to deal with the vast amount of genetic diversity (Cho et al., 2016; Seo et al., 2016; Shi et al., 2016; Ameur et al., 2018; Du et al., 2019; He et al., 2020; Wohlers et al., 2020; Takayama et al., 2021; Lou et al., 2022). The use of these references is expected to provide advantages in NGS analysis, e.g., efficient variant identification in clinical practice. Although the existence of multiple references has been criticized as complicating the interpretation of an analysis (Kaminow et al., 2022), population-specific reference genomes have the potential to improve the accuracy and sensitivity of genetic and medical analyses (Lou et al., 2022). Unfortunately, however, many attempts to construct a population-specific reference have only provided the assembled DNA sequences in FASTA format, so we have to assign contigs to chromosomes. We also need to prepare annotations and various analytical resources.

With the recent advent of long-read technologies, the telomere-to-telomere (T2T) sequence has been constructed (Nurk et al., 2022). The T2T-CHM13 achieved a near-gapless telomere-to-telomere assembly. Although the T2T-CHM13 reference will be the next prevailing reference for human genetics (Aganezov et al., 2022), the T2T-CHM13 genome was derived from a single European cell line and hence has a population bias (Nurk et al., 2022). Another new attempt to address the limitations of current references is a pangenome reference graph (Paten et al., 2017; Rakocevic et al., 2019). The genome graph aims to comprehensively assess human genetic variation by including the vast amount of genetic diversity using graphical representation (Paten et al., 2017). In human genomics, practical application of the pangenome reference is already underway (Liao et al., 2023), but the genetic diversity is still not fully represented, and attempts are being made to construct a population-specific pangenome reference (Gao et al., 2023). Graphical representation may soon replace the conventional references (Paten et al., 2017; Rakocevic et al., 2019; Wang et al., 2022), but conventional linear reference genomes will continue to be used due to their high affinity with previous analyses and ease of interpretation. However, when using a linear population-specific reference genome, numerous accompanying resources must be prepared for analysis, which requires considerable knowledge and effort. Therefore, we constructed the resources necessary to analyze short-read sequencing using JG2.1.0.

There are several limitations in this work. First, the resources we constructed are specific to JG. Although JG can be used for other Asian populations as suggested in our previous paper (Takayama et al., 2021), individual reference genomes and associated resources tailored to each population are more suitable. Therefore, similar resources are needed for each population. In addition, the resources need to be updated regularly to keep up with updates to references and annotations. The development of new technologies, such as T2T, and addition of more annotations requires resource updates as well as addition of some T2T-specific resources of reference genomes. Although it is not easy to keep up with the rapid advancement and maintain comprehensive and up-to-date resources, the concise procedures for resource construction that we have provided in this study will facilitate resource maintenance. We encourage researchers and laboratories to follow these procedures to generate customized resources and to perform appropriate NGS analyses. Second, we did not evaluate the accuracy and efficiency of migrating to JG. Liftover is often considered a lossy process due to the complexity of genome structure and the limited accuracy of chain files (Lowy-Gallego et al., 2019; Ormond et al., 2021; Sheng et al., 2023). Although re-aligning the raw sequence data to another reference genome usually provides the most accurate results, the re-alignment process is generally impractical due to the high computational demand and time consumption and is often not feasible without access to the raw sequence data. The liftover approach is generally a rapid, cost-effective and practical process compared to the re-alignment approach (Zhao et al., 2014; Luu et al., 2020; Shumate and Salzberg, 2021; Park et al., 2023). When used carefully, with the understanding that it does not guarantee completely accurate conversion, liftover has the potential to provide feasible and effective NGS analysis.

Today, with the widespread use of NGS analysis, the population-specific reference is one of the choices to deal with vast genetic diversity. However, references alone are not sufficient to handle the burden of migrating references and running analyses, so we have created a set of resources based on JG and provided them here. The development of an ecosystem for each reference is an important foundation, giving users “analysis-ready” material.

DATA AVAILABILITY

All the resources we have constructed are located at the repository sites listed in Table 1. The variant call procedure using JG-based resources is shown in the Supplementary Information.

ACKNOWLEDGMENTS

We express our sincere gratitude to Dr. Justin Matthew Zook and the Genome in a Bottle Consortium for granting us permission to use and distribute subsampled Han Chinese trio genomic data. We also extend our heartfelt thanks Dr. Xiaoming Liu for allowing us to publish the JG-based dbNSFP database. We greatly appreciate all the volunteers who participated in the Tohoku Medical Megabank project. Some of the computational resources were provided by the Tohoku Medical Megabank Organization supercomputer system supported by the Facilitation of R&D Platform for AMED Genome Medicine Support (Grant Number JP21tm0424601).

REFERENCES

Corresponding author

Register with J-STAGE for free!