Complete chloroplast genome and 45S nrDNA sequences of the medicinal plant species Glycyrrhiza glabra and Glycyrrhiza uralensis

Glycyrrhiza uralensis and G. glabra , members of the Fabaceae, are medicinally important species that are native to Asia and Europe. Extracts from these plants are widely used as natural sweeteners because of their much greater sweetness than sucrose. In this study, the three complete chloroplast genomes and ﬁve 45S nuclear ribosomal (nr)DNA sequences of these two licorice species and an interspeciﬁc hybrid are presented. The chloroplast genomes of G . glabra , G . uralensis and G . glabra × G . uralensis were 127,895 bp, 127,716 bp and 127,939 bp, respectively. The three chloroplast genomes harbored 110 annotated genes, including 76 protein-coding genes, 30 tRNA genes and 4 rRNA genes. The 45S nrDNA sequences were either 5,947 or 5,948 bp in length. Glycyrrhiza glabra and G. glabra × G. uralensis showed two types of nrDNA, while G. uralensis contained a single type. The complete 45S nrDNA sequence unit contains 18S rRNA, ITS1, 5.8S rRNA, ITS2 and 26S rRNA. We identiﬁed simple sequence repeat and tandem repeat sequences. We also developed four reliable markers for analysis of Glycyrrhiza diversity authentication.


INTRODUCTION
Licorice is a perennial herb belonging to the family Fabaceae. The genus Glycyrrhiza includes about 18 species in Asia, Europe and the Americas. Glycyrrhiza uralensis Fisch. occurs from Central Asia to the northeastern part of China, whereas G. glabra L. is distributed from southern Europe to the northwestern part of China. The roots and stolons of G. uralensis and G. glabra produce some of the most important crude drugs in the world (Gibson, 1978), mainly glycyrrhizin, an oleanane-type triterpene saponin. Glycyrrhiza plants have been used traditionally as anti-inflammatory (Finney and Somers, 1958;Kroes et al., 1997), antiviral (Fiore et al., 2008), antiallergy (Park et al., 2004) and antiulcer treatments (He et al., 2001). Because licorice extracts are approximately 150 times sweeter than sucrose (Kitagawa, 2002), they are also widely used in the world as a natural sweetener, with an annual value of over US $42 million (Parker, 2006). As a medicinal plant, correct authentication of licorice plant ingredients ensures their safe use.
Chloroplast (CP) genome sequences are of central importance to tracing plant taxonomy and authentication because they are highly conserved across plant species. The CP genome is composed of a large singlecopy region, a small single-copy region and two inverted repeats (IRs) (Gary et al., 1984;Shinozaki et al., 1986;Leseberg and Duvall, 2009). Interestingly, licorice spe-cies belong to the inverted repeat-lacking clade (IRLC) (Wojciechowski et al., 2004) of papilionoid legumes, characterized by the loss of one copy of the IR. To date, only the CP genome of G. glabra has been sequenced among the Glycyrrhiza species (Sabir et al., 2014).
The sequence of the 45S nuclear ribosomal DNA (nrDNA), bearing the 18S-5.8S-26S ribosomal RNA genes, also provides additional information that can be very useful in plant taxonomy and DNA barcoding (Chen et al., 2014;Techen et al., 2014;Mishra et al., 2016). In particular, internal transcribed spacer (ITS1 and ITS2) sequences in nrDNA are potential barcodes (Álvarez and Wendel, 2003;Yao et al., 2010). Although these sequences are valuable for medicinal identification, there is little information about their comparison and polymorphism between Glycyrrhiza species.
In the current study, we analyzed the complete sequences of the CP and nrDNA of two Glycyrrhiza species. In addition, we identified 160 polymorphic sites in the CP genome and 10 polymorphic sites in the nrDNA that are valuable for the identification and authentication of G. glabra and G. uralensis as well as G. glabra × G. uralensis interspecific hybrids. Despite their useful applications as medicinal ingredients and food resources, there is limited information regarding the complete chloroplast genomes and the nrDNA sequences of Glycyrrhiza species. The results of this study provide an insight into the genetic relationships among the various species in the genus Glycyrrhiza.
Illumina sequencing and de novo assembly of CP and nrDNA Paired-end (PE) libraries were constructed with insert sizes ranging from 280 to 430 bp and following the manufacturer's specified protocols for the TruSeq PE Cluster Kit (Illumina, San Diego, CA, USA). The PE libraries were sequenced using the Illumina genome analyzer (HiSeq 1000, Illumina) platform at our in-house facility (Genomics Division, National Institute of Agricultural Sciences, Korea). CP genome and nrDNA de novo assembly was accomplished using approaches described in Kim et al. (2015). In short, sequences of low quality were trimmed to below Phred scores of 20 using CLC quality trim software. The remaining high-quality sequences were assembled into contigs, using CLC genome assembler beta 4.06 (CLC, Aarhus, Denmark) with a minimum of 150-500 bp autonomously controlled overlap size, at Phyzen (Seongnam, South Korea). The obtained CP genome sequences were assembled using the G. glabra (KF201590) genome as a reference sequence. The assembled nrDNA contigs fully covered the 45S nrDNA cistron unit and partially covered the intergenic spacer sequences.
Identification of polymorphisms that can distinguish Glycyrrhiza species Four PCR primers (Supplementary Table S1) were designed based on CP InDels and nrDNA-specific sequence regions among Glycyrrhiza species. These primers were used to distinguish G. glabra and G. uralensis as well as G. glabra × G.
uralensis. The PCR conditions were 4 min at 94 °C followed by 38 cycles of 94 °C for 30 s, 60 °C for 30 s and 72 °C for 15 s, followed by a final extension at 72 °C for 1 min. Gel electrophoresis was performed using a 1% agarose gel, and amplified fragments were stained with a fluorescent dye.

RESULTS AND DISCUSSION
After sequencing, we employed a combination of de novo assembly and reference-guided strategies using Illumina PE reads ranging from 587 to 741 Mbp, which represents approximately 226-to 400-fold CP genome coverage. The complete CP genomes of G. glabra, G. uralensis and G. glabra × G. uralensis were circles of 127,895 bp, 127,716 bp and 127,939 bp, respectively (Table  1). The complete CP gene content and order were identical among the Glycyrrhiza species (Fig. 1). These three CP genomes belong to the IRLC (Wojciechowski et al., 2004) of papilionoid legumes, where the loss of one copy of the IR has occurred. The Glycyrrhiza CP genomes harbor 110 annotated genes: 76 protein-coding genes, 30 tRNA genes and 4 rRNA genes (Table 2). Among these, nine protein-coding and six tRNA genes contain a single intron, while one gene (ycf3) contains two introns. infA, rpl22 and rps16 were absent in Glycyrrhiza species. Two of these genes, infA and rpl22, are also missing from the CP genomes of other legumes (Doyle et al., 1995) but are present in the nucleus (Gantt et al., 1991), and the loss of rps16 from CP DNA in Medicago and Populus has been reported (Ueda et al., 2008). Whole-genome alignments of Glycyrrhiza species with the annotation of G. glabra (KF201590) (Sabir et al., 2014) as a reference using mVISTA revealed their sequence variation (Fig. 2). The whole CP genome alignments showed that the coding regions are more highly conserved than the intergenic regions, as is the case in most angiosperms. Analysis of sequence variation between G. glabra (KF201590) and G. glabra (KU891817) showed 30 single-nucleotide polymorphisms (SNPs) and 24 insertions-deletions (InDels).
These SNPs and InDels may provide valuable information for authenticating Glycyrrhiza species. The CP genome of G. glabra × G. uralensis shared 99.98 and 99.85% nucleotide sequence identity with G. glabra and G. uralensis, respectively, indicating that Glycyrrhiza species also follow the mode of maternal plastid inheritance (Hagemann et al., 2004).
The nrDNA sequences were assembled into single contigs that were either 5,947 bp or 5,948 bp in length. Glycyrrhiza glabra and G. glabra × G. uralensis showed two types of nrDNA, while G. uralensis contained a single type of nrDNA (Table 1). The complete nrDNA sequence unit contains 18S rRNA, ITS1, 5.8S rRNA, ITS2 and 26S rRNA (Fig. 3). The average GC content ranged between 53.86 and 53.91%, which is almost identical among the five nrDNAs (Fig. 3).
Repeat sequences in the CP genomes of G. glabra, G. uralensis and G. glabra × G. uralensis were analyzed using Tandem Repeats Finder, version 4.0. A total of 20 unique sequences of tandem repeats were detected in the Glycyrrhiza CP genomes (Supplementary Table S2). The lengths of tandem repeats in the CP genomes ranged from 11 to 39 bp, and most of the tandem repeats appear in two copies. As in Bupleurum falcatum (Shin et al., 2016), most of the tandem repeat sequences were in noncoding regions, with only three genic regions (rps11, rpl20

Fig 3
and ycf1) containing tandem repeat sequences. Tandem repeat sizes identified in Glycyrrhiza CP genomes were invariably less than 40 bp, which is sufficient for illegitimate recombination (Sherman-Broyles et al., 2014). SSRs, also known as microsatellites, frequently occur in CP genomes. In this study, mononucleotide SSRs were excluded. We identified 350, 349 and 352 SSRs with a length of at least 10 bp in G. glabra, G. uralensis and G. glabra × G. uralensis, respectively (Fig. 4). Among the SSRs, the pentanucleotide SSRs were the most abundant in the CP genomes, accounting for 84% of total SSRs. Di-, tri-and tetranucleotide repeats were composed of A or T at a higher level, which reflects AT richness in the CP genomes (Zhang et al., 2011;Yi and Kim, 2012). These SSRs may further serve as genetic markers for phylogenetic and medicinal plant authentication studies (Zhang et al., 2016). We detected 160 and 10 SNPs from the Glycyrrhiza CP genomes and nrDNAs, respectively (Supplementary  Table S3 and S4). Like SSRs, most SNPs in chloroplast DNA are located in non-coding regions, whereas SNPs in nrDNA were detected in ITS1, ITS2 and 26S. Furthermore, we identified 83 InDels in the Glycyrrhiza CP genomes. PCR primers were designed based on InDels and specific sequence regions (Supplementary Table  S1). We successfully amplified four PCR products that can distinguish between G. glabra and G. uralensis species (Fig. 5). The primer pairs ycf3F01/ycf3R01, atpHF01/ atpHR01 and ycf2F01/ycf2R01 amplified PCR products in all three Glycyrrhiza CP genomes. On the other hand, the 5.8SF01/5.8SR01 primer pair amplified a PCR product only in G. glabra and G. glabra × G. uralensis, in nrDNA. These primers will be used as Glycyrrhiza authentication markers.
In this study, the complete Glycyrrhiza CP genomes and nrDNA have been sequenced. These genomes belong to the IRLC of papilionoid legumes, which is characterized by the loss of one copy of the IR. The complete CP genomes of G. glabra, G. uralensis and G. glabra × G. uralensis were 127,895 bp, 127,716 bp, and 127,939 bp, respectively. The nrDNA sequences were either 5,947 bp or 5,948 bp. Glycyrrhiza glabra and G. glabra × G. uralensis showed two types of nrDNA, while G. uralensis contained a single type of nrDNA. We developed four reliable markers for the analysis of Glycyrrhiza diversity authentication. This study will open up further avenues of research to develop a better understanding of the molecular ecology and molecular phylogeny within Glycyrrhiza species.  uralensis and G. glabra × G. uralensis, respectively. 1-4 represent the ycf3F01-ycf3R01, atpHF01-atpHR01, ycf2F01-ycf2R01 and 5.8SF01-5.8SR01 primer pairs, respectively. a, b PCR products are derived from CP genomes and nrDNA-based markers, respectively.