Comparative analysis of microsatellite, SNP, and InDel markers in four Rhododendron species based on RNA-seq

Shuzhen Wang; Zhiliang Li; Xudong Guo; Yuanping Fang; Jun Xiang; Weibin Jin

doi:10.1270/jsbbs.18092

Abstract

Rhododendron possesses valuable horticultural and medicinal properties. However, the genetic studies have been hindered due to the lack of genetic markers. Based on RNA-seq, large-scale molecular markers were developed from four Rhododendron species endemic to Dabie Mountains (central China): R. fortunei (5.25 Gb; SSRs, 12,756, one/2.37 kb, 147 types; SNPs, 38,313; InDels, 3,174), R. simsii (5.80 Gb; SSRs, 13,294, one/2.58 kb, 167 types; SNPs, 136,590; InDels, 6,258), R. mariesii (6.53 Gb; SSRs, 15,724, one/2.51 kb, 170 types; SNPs, 44,942; InDels, 4,126), and R. molle (4.35 Gb; SSRs, 10,214, one/2.49 kb, 110 types; SNPs, 77,829; InDels, 3,416). Di-nucleotide repeats were the main type (59.126%–64.314%), and AG/CT repeat (55.18%–61.22%) was the most. In particular, 89 species-specific types had been found. Furthermore, C:G→T:A mutation was the main SNP type (30.475%–34.99%). However, C:G→G:C mutation was the least type in R. fortunei, while T:A→G:C mutation was the least in the other three species. InDels with length of 3 nt was most in R. fortunei, but 1 nt InDels were the main type in the other three species. Twelve microsatellite markers developed from R. simsii reveled high genetic diversity in the four populations, and heterozygote excess was observed. This research would benefit the genetic study, molecular marker-assisted selection, and breeding studies in Rhododendron species.

Introduction

The genus Rhododendron (approximate 1,000 species), a large vascular plant genera, possesses valuable horticultural and medicinal properties (Popescu and Kopp 2013, Wang et al. 2017). Most species of this genus distribute in the Northern Hemisphere, especially in Malaysia, Himalayan region, and South-East Asia (Bhattacharyya 2011). In particular, 650 species are unique to China, and the juncture of Tibetan, Yunnan, and Sichuan provinces is key distribution center of Rhododendron species (Wang et al. 2017, Xu et al. 2017). Additionally, Dabie Mountains (central China) shows to be another diversity distribution center of Rhododendron species due to unique geographic feature (Wang et al. 2017).

Several Rhododendron species have been traditionally used in ancient medical systems, including Chinese, European, Ayurvedic, and North American folk medicine (Popescu and Kopp 2013). Moreover, they are often consumed for teas, jams, timber, insecticides, firewood, and even honey (Saǧiroǧlu et al. 2012). As main components of forest ecosystems, Rhododendron species are vital for ecosystem stability (Wang et al. 2017). Furthermore, R. pulchrum and R. hybridum have been domesticated and cultivated for beautiful flowers, leaves, and even tree shape, which play vital roles in the gardening, agricultural and horticultural sceneries. Therefore, the investigations of Rhododendron species are necessary, not only for biodiversity conservation, but also for sustainable utilization of Rhododendron germplasm resources. However, the genetic studies of Rhododendron species have been hindered due to the lack of genomic resources and genetic markers.

Molecular markers, especially microsatellites (simple sequence repeats, SSRs), SNP (single nucleotide polymorphism), and InDels (insertions and deletions), revealing polymorphisms at DNA level, show great potential in the study of molecular biology, genetics, landscape genetics, and conservation genetics. Microsatellites are powerful tool in genetic research due to dispersal throughout the whole genome, co-dominant inheritance, high reproducibility, high polymorphism, and high level of transferability (Zheng et al. 2013). Moreover, SNPs and InDels developed through high-throughput sequencing, also play vital roles in constructing high-resolution genetic map, analyzing genetic diversity and population structure, as well as investigating life-history evolution and population dynamics (Chopra et al. 2015). The next generation RNA sequencing (RNA-seq) enables the generation of genomic resources easily, therefore offers critical turning point for relevant Rhododendron research. Besides gene discovery and functional genomics studies, abundant molecular markers could also be developed from transcriptome data (Choudhary et al. 2018, Fang et al. 2017, Xiao et al. 2015, 2018, Xing et al. 2017).

In this study, RNA-seq was performed on total RNA extracted from flowers of four Rhododendron species (R. fortunei, R. mariesii, R. simsii, and R. molle) endemic to Dabie Mountains. Based on the transcriptome data, characteristics of SSR, SNP, and InDel markers were clarified. Furthermore, genetic diversity analysis of these four Rhododendron populations was carried out with SSR markers. This research would be contributed to the development of polymorphic markers, molecular marker assisted selection, breeding studies, as well as species conservation and sustainable utilization of the genus Rhododendron.

Materials and Methods

Materials

Blooming flowers of four wild Rhododendron species (five plants per species) were sampled from Dabie Mountains (30°57′20″–31°06′10″N, 116°02′20″–116°10′53″E, 900–1000 m), including R. fortunei, R. mariesii, R. simsii, and R. molle (Fig. 1). Moreover, tender leaves of R. fortunei, R. mariesii, R. simsii, and R. molle populations were also sampled, and each population contained 30 individuals. All these materials were conserved in Huanggang Normal University Herbarium (Huanggang, Hubei, China).

Fig. 1

Flower tissues of four Rhododendron species (A: R. fortunei; B: R. simsii; C: R. mariesii; and D: R. molle).

Transcriptome sequencing and functional annotation

Total RNA was extracted with TRIzol kit (Takara) following the manufacturer’s instructions. The mRNAs were enriched with magnetic beads coated with oligo(dT)₂₅, and reverse transcribed to first-strand cDNA. Then, double-strand cDNA was obtained with random hexamer primers. After addition of sequencing linkers, origin mRNA was digested, and cDNA was amplified by PCR. Fragments of about 500 bp were purified, and the quality and quantity were validated on an Agilent 2100 Bioanalyzer and an ABI StepOnePlus Real-time PCR System, then were used to construct paired-end libraries. The 500 bp libraries were sequenced on an Illumina Hiseq 2500 sequencer with a paired-end (PE) 125 strategy (Wuhan Ice Harbor Biotechnology Co., Ltd., Wuhan, China).

Functional annotation of these unigenes was carried out through the highest BLAST hit against GO (Gene ontology), KOG (euKaryotic Ortholog Groups), KEGG (Kyoto Encyclopedia of Genes and Genomes), Nt (NCBI nucleotide sequences), Nr (NCBI non-redundant protein sequences), and Swiss-Prot databases with the E-value cut off of 1E⁻⁵, as well as against the Pfam (Protein family) database with a hmmscan threshold of 1E⁻³ (Conesa et al. 2005, Kanehisa et al. 2012).

Characteristics of microsatellite markers

Raw data was filtered by FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) and CUTADAPT (https://pypi.python.org/pypi/cutadapt/1.4.2) to remove low-quality sequence (more than 20% of the bases with quality value less than 10), reads with unknown nucleotides over 5%, as well as sequences containing adapters. Clean reads passed the filter threshold were assembled with Trinity software (Grabherr et al. 2011), which were further screened with MISA software (MicroSAtellite, http://pgrc.ipk-gatersleben.de/misa) in search of SSR motifs. In particular, the settings for minimum number of repeats were as follows: mono-nucleotide repeats ≥ 10 repeats; di-nucleotide repeats ≥ 6 repeats; tri-nucleotide repeats, tetra-nucleotide repeats, penta-nucleotide repeats, and hexa-nucleotide repeats all had at least 5 repeats, respectively. In related to compound SSRs, the minimum distance between two SSR motifs was restricted to 100 bp.

Mining of putative SNP and InDel markers

Sequencing data was outputted into SAM (Sequence Alignment/Map) format, which was further converted to BAM format using SAM tools 0.1.18 (Li et al. 2009). Variant calling analysis was carried out using SAMtools mpileup and bcftools with default parameters, resulting in a VCF file (variant call format) (Li et al. 2009). In particular, SNPs with low quality (QUAL < 999, DP < 30) were removed (Danecek et al. 2011). Moreover, high-quality InDel variants (quality scores >50) were also screened.

Genetic diversity analysis based on microsatellite markers

Genomic DNA was extracted from fresh leaves with a modified CTAB (cetyltrimethyl ammonium bromide) method and diluted to 50 ng/μL (Wang et al. 2017). A set of 45 R. simsii unigenes containing (AG/CT)n motifs (n ≥ 20) and enough flanking sequences (more than 100 bp) were randomly chosen for prime pair design with online software Primer 3 (Wang et al. 2017). In total, twenty SSR markers were developed for analysis of population diversity (Supplemental Table 1). The 10 μL SSR-PCR reaction system consisted of 0.2 μM for each primer, 50 ng genomic DNA, and 5 μL 2 × Taq Plus PCR MasterMix (TianGen, Beijing, China). PCR amplification conditions included initial denaturation at 95°C for 10 min, 35 amplification cycles (94°C for 30 s, annealing at optimal temperature for 40 s, and 72°C for 50 s), and a 10 min elongation step at 72°C. The PCR products were separated on 6% (w/v) denaturing polyacrylamide sequencing gels along with DNA marker (20–500 bp, Takara), and the gels were further visualized by silver staining. PCR products were scored manually, and the genetic parameters were calculated with the ARLEQUIN 3.01 software, including number of alleles (Na) per locus, expected heterozygosity (H_E), observed heterozygosity (H_O), and deviation from Hardy-Weinberg equilibrium (HWE) (Excoffier et al. 2005).

Results

RNA-seq and functional annotation of four Rhododendron species

After stringent quality assessment and data filtering, Illumina Hiseq 2500 sequencing produced 20,999,440 clean reads (5.25 gigabases, Gb) for R. fortunei, 23,201,636 (5.80 Gb) for R. simsii, 26,109,041 (6.53 Gb) for R. mariesii, and 17,406,941 (4.35 Gb) for R. molle, respectively (Table 1). Moreover, the clean rates ranged from 99.96% to 99.98%. The rates for Q20 bases (a base with quality value greater than 20) ranged from 96.1% to 97.0%. Raw data were available in NCBI’s Sequence Read Archive under accession numbers of SRR5242503 (R. fortunei), SRR5247112 (R. simsii), SRR5247113 (R. mariesii), and SRR5247114 (R. molle). Short-read sequences were firstly assembled into contigs with Trinity software, and further connected into 59,887, 92,469, 81,710, and 58,263 unigenes for R. fortunei, R. simsii, R. mariesii, and R. molle, respectively (Table 1). The N50 values ranged from 1,372 bp (R. molle) to 1,477 bp (R. mariesii). In related to R. fortunei, 12,824 (21.41%) unigenes ranged from 1,000 bp to 2,000 bp, and 6,153 (10.27%) unigenes had lengths of more than 2,000 bp (Fig. 2A). Additionally, numbers of unigenes ranging from 1,000 bp to 2,000 bp were 14,962 (16.18%), 16,812 (20.58%), and 11,919 (20.46%) for R. simsii, R. mariesii, and R. molle, respectively (Fig. 2B–2D).

Table 1 Transcriptome reads and assembled unigene information for four Rhododendron species

	R. fortunei	R. simsii	R. mariesii	R. molle
Clean reads	20,999,440	23,201,636	26,109,041	17,406,941
Clean bases	5.25 Gb	5.80 Gb	6.53 Gb	4.35 Gb
Clean rate	99.97%	99.98%	99.96%	99.97%
Q20	97.00%	96.80%	96.10%	96.80%
Numbers of unigenes	59,887	92,469	81,710	58,263
N50	1,465 bp	1,465 bp	1,477 bp	1,372 bp
Total number of identified SSRs	12,756	13,294	15,724	10,214
Number of SSR containing sequences	9,108	10,076	11,267	7,644
Number of sequences containing more than 1 SSR	2,724	2,510	3,345	1,965
Number of SSRs present in compound formation	1,585	1,337	1,814	1,058

Fig. 2

Length distribution of unigenes for four Rhododendron species.

Contigs from the four species were pooled and assembled into a non-redundant unigene set, yielding 125,502 unigenes with an average length of 452 bp and an N50 of 540 bp. Totally, 82,623 unigenes (65.834%) could be annotated to at least one database, and 20,124 unigenes (16.035%) were annotated to all seven databases. In particular, 50,104 unigenes (39.923%) showed significant identity to GO database, while ‘metabolic process’ (35,178 unigenes), ‘cell and cell part’ (20,557 unigenes), and ‘catalytic activity’ (26,665 unigenes) represented the largest subcategories under ‘biological process’, ‘cellular component’, and ‘molecular function’ categories, respectively. Furthermore, 40,600 (32.35%) unigenes could be annotated to KOG database, and the clusters of ‘general function prediction’ (7,227 unigenes), ‘posttranslational modification, protein turnover, chaperones’ (4,710 unigenes), ‘signal transduction mechanisms’ (3,613 unigenes), ‘energy production and conversion’ (2,923 unigenes), ‘translation, ribosomal structure and biogenesis’ (2,883 unigenes) comprised the first five largest groups. In addition, 41,014 (32.68%) unigenes matched the KEGG database, which were assigned to 5 main categories: ‘organismal systems’ (11,073 unigenes), ‘metabolism’ (7,168 unigenes), ‘genetic information processing’ (10,265 unigenes), ‘environment information processing’ (7,177 unigenes), and ‘cellular processes’ (5,331 unigenes).

A lot of homologous genes involved in synthesis and participation of flower pigments (anthocyanin, carotenoids, and betalains) were found, including ‘flavonoid biosynthesis’ (GO: 0009813), ‘flavonol biosynthesis process’ (GO: 0051555), ‘anthocyanin-containing compound biosynthesis process’ (GO: 0009718), ‘positive regulation of flavonoid biosynthesis’ (GO: 0009963), ‘anthocyanin accumulation in tissue in response to UV light’ (GO: 0043418), ‘regulation of anthocyanin biosynthesis process’ (GO: 0031540), ‘anthocyanin-containing compound metabolic process’ (GO: 0046283), and ‘carotenoid biosynthesis process’ (GO: 0016117). In total, 59 unigenes involved in anthocyanidin synthesis pathway were identified through searching against Nr database, such as homologs of anthocyanidin synthase (ASN), chalcone isomerase (CHI), chalcone synthase (CHS), dihydroflavonol 4-reductase (DFR), flavonol 3-hydroxylase (F3H), flavonoid 3′-hydroxylase (F3′H), 3′,5′-hydroxylase (F3′5′H), flavonol synthase (FLS), and flavonoid 3-O-glucosyltransferase (UFGT).

Characteristics of microsatellite markers

Totally, 12,756, 13,294, 15,724 and 10,214 SSR loci had been identified with distribution densities of one SSR locus per 2.37 kb, 2.58 kb, 2.51 kb, and 2.49 kb, while 1,585, 1,337, 1,814 and 1,058 SSRs were present in compound formation in transcriptome of R. fortunei, R. simsii, R. mariesii and R. molle, respectively (Table 1). Particularly, di-nucleotide repeats were the main type, with the frequencies ranged from 59.126% (R. mariesii) to 64.314% (R. molle) (Table 2). The mono-nucleotide motifs (19.571%–24.873%) and tri-nucleotide motifs (15.149%–15.759%) were the second and third main types for all four species, respectively. Moreover, penta-nucleotide repeats were the least type for R. fortunei (0.172%) and R. simsii (0.218%), while hexa-nucleotide repeats were the least one for both R. mariesii (0.191%) and R. molle (0.088%).

Table 2 Numbers and percentages of different repeat types in four Rhododendron species

Types	R. fortunei		R. simsii		R. mariesii		R. molle
Types	Number	Percentage	Number	Percentage	Number	Percentage	Number	Percentage
Mono-	2,532	19.849%	3,035	22.830%	3,911	24.873%	1,999	19.571%
Di-	8,165	64.009%	8,044	60.509%	9,297	59.126%	6,569	64.314%
Tri-	1,966	15.412%	2,095	15.759%	2,382	15.149%	1,601	15.675%
Tetra-	46	0.361%	49	0.369%	64	0.407%	22	0.215%
Penta-	22	0.172%	29	0.218%	40	0.254%	14	0.137%
Hexa-	25	0.196%	42	0.316%	30	0.191%	9	0.088%
Total	12,756	100.000%	13,294	100.000%	15,724	100.000%	10,214	100.000%

A total of 147, 167, 170, and 110 different repeat types had been found for R. fortunei, R. simsii, R. mariesii, and R. molle, respectively. For mono-nucleotide repeats, A/T was the main type, ranging from 19.05% (R. fortunei) to 24.186% (R. mariesii). Similarly, the AG/CT (55.18%–61.22%), AC/GT (1.889%–2.309%), and AT/AT (1.165%–1.603%) were the main di-nucleotide repeat units. Furthermore, AAG/CTT (4.014%–4.494%), AGG/CCT (2.821%–3.123%), ACC/GGT (1.959%–2.279%), and AGC/CTG (1.527%–1.610%) were the main tri-nucleotide repeat units. The main tetra-nucleotide repeat was AAAG/CTTT (0.059%–0.083%). However, the main types of penta- and hexa-nucleotide repeats differed largely among these four species (Supplemental Table 2).

Tetra-, penta-, and hexa-nucleotide repeats differed largely in these four Rhododendron species. In particular, 7 tetra-nucleotide repeats and 3 penta-nucleotide repeats were possessed by all four species. No same hexa-nucleotide repeats existed in these four species. Totally, 89 different types had been searched, of which 13, 34, 42 types were tetra-, penta-, and hexa-nucleotide repeats, respectively (Supplemental Table 2). For tetra-nucleotide repeats, ATCC/ATGG only occurred in R. fortunei; ACTC/AGTG and AACC/GGTT were peculiar to R. simsii; AATG/ATTC and AATT/AATT were owned only by R. mariesii; ACGT/ACGT was peculiar to R. molle. In related to penta-nucleotide repeats, 4, 5, 15, and 2 types were peculiar to R. fortunei, R. simsii, R. mariesii, and R. molle, respectively. Moreover, 6, 12, 7, and 3 hexa-nucleotide repeats were specific to R. fortunei, R. simsii, R. mariesii, and R. molle, respectively.

For mono-nucleotide repeats, SSRs with repeat numbers of 9–12 were the most abundant, with the frequencies of 12.25%, 12.89%, 13.36%, and 11.63% for R. fortunei, R. simsii, R. mariesii, and R. molle, respectively (Fig. 3). In related to di-nucleotide repeats, SSRs with 5–8 repeats were the most abundant, followed by these with 9–12 tandem repeats in all four species. No tri-nucleotide repeats with repeat numbers more than 24 existed in R. fortunei. Moreover, tri-nucleotide repeats with repeat numbers of more than 21 were rare in R. mariesii, with the frequency of 0.1%. In related to R. simsii, there were only tri-nucleotide repeats showing less than 16 reiterations. Furthermore, the maximum reiteration of tri-nucleotide repeat motifs was 20 in R. molle. Most of tetra-nucleotide repeats showed repeat numbers of 5–8 in all four species, while only 3 (0.02%) and 1 (0.01%) motifs repeating more than 9 times were found in R. fortunei and R. mariesii, respectively. Only penta-nucleotide repeats with repeat numbers of 5–8 were found in all four species. Furthermore, most hexa-nucleotide repeats repeated 5–8 times (0.09–0.31%), while only a few repeated 9–12 times (0–0.02%) in four species.

Fig. 3

Repeat counts of different SSR motif unit (A: R. fortunei; B: R. simsii; C: R. mariesii; and D: R. molle).

Characteristics of SNP markers

Totally, 38,313, 136,590, 44,942, and 77,829 putative SNPs have been identified in flower transcriptomes of R. fortunei, R. simsii, R. mariesii, and R. molle, respectively (Fig. 4). Additionally, six types had been found, including T:A→C:G, T:A→A:T, T:A→G:C, C:G→T:A, C:G→G:C, and C:G→A:T. In particular, the C:G→T:A SNP was the main type, ranging from 30.475 % (13,696, R. mariesii) to 34.999% (27,239, R. molle). Moreover, the T:A→C:G and T:A→A:T SNPs were the second (25.574%–27.217%) and third most types (17.402%–18.464%), respectively. In transcriptome of R. fortunei, C:G→G:C SNP was the least type, with a frequency of 5.982% (2,292). However, T:A→G:C SNP was the least in transcriptomes of R. simsii (7.514%, 10,263), R. mariesii (7.777%, 3,495), and R. molle (6.742%, 5,247).

Fig. 4

Mutation spectrum of SNP markers in four species (A: R. fortunei; B: R. simsii; C: R. mariesii; and D: R. molle, respectively).

Furthermore, these SNPs could be classified into transitions (purine to purine or pyrimidine to pyrimidine) and transversions (purine to pyrimidine or pyrimidine to purine) (Choudhary et al. 2018). In related to these four Rhododendron species, two types of transitions were found, including T:A→C:G and C:G→T:A, while the rest four types (T:A→A:T, T:A→G:C, C:G→G:C, and C:G→A:T) were classified into transversions. In particular, base transitions were the highest SNP type in all four Rhododendron species, and the percentages ranged from 57.692% to 60.901%, while the percentages of base transversions varied between 39.099% and 42.308%. Base transitions took place frequently in R. fortunei (60.901%), followed by R. molle (60.573%) and R. simsii (58.684%). However, base transversions occurred frequently in R. mariesii (42.3079%), and then was in R. simsii (41.316%).

Characteristics of InDel markers

A set of 3,174, 6,258, 4,126, and 3,416 putative InDel markers, containing “insertion markers” and “deletion markers”, were found in transcriptomes of R. fortunei, R. simsii, R. mariesii, and R. molle, respectively. In all four species, “deletion markers” were slightly more than “insertion markers” (Fig. 5). However, the 6 nt “insertion markers” were more than 6 nt “deletion markers” in all four species. For the 7 nt InDels, “insertion markers” were more than “deletion markers” in R. fortunei, R. simsii, and R. molle. Similarly, “insertion markers” were more than “deletion markers” regarding with both the 3 nt (R. fortunei, R. mariesii, and R. molle) and 4 nt InDels (R. fortunei, R. simsii, and R. mariesii). In R. fortunei and R. mariesii, the 1 nt “insertion markers” were also more than 1 nt “deletion markers”. Furthermore, 13 nt “insertion markers” were a little more than 13 nt “deletion markers” in both R. simsii and R. molle. In particular, numbers of 5 nt, 9 nt, 11 nt, and 19 nt “insertion markers” were equal to that of “deletion markers” in R. molle. Moreover, the 17 nt and 14 nt “insertion markers” were equal to that of “deletion markers” in R. fortunei and R. mariesii, respectively.

Fig. 5

Distribution of InDel markers with different lengths in four Rhododendron species (A: R. fortunei; B: R. simsii; C: R. mariesii; and D: R. molle).

Basically, numbers of InDels decreased with the increase of length. In particular, InDels with lengths of 1–9 nt made up large proportions, with the percentages ranging from 85.167% (R. mariesii) to 90.412% (R. simsii). Moreover, the InDels with lengths of more than 10 nt took up only 13.705%, 9.588%, 14.833%, and 10.158% in R. fortunei, R. simsii, R. mariesii, and R. molle, respectively (Fig. 5). In R. fortunei, InDels with length of 3 nt (623, 19.628%) was the main type, followed by 1 nt (606, 19.093%) and 2 nt InDel types (482, 15.186%). However, the 1 nt InDels (20.383%–26.686%) were the main type, followed by 3 nt (16.990%–20.872%) and 2 nt (15.324%–15.544%) InDels in R. simsii, R. mariesii, and R. molle. In all four Rhododendron species, the 19 nt InDel markers were the least type (10–24, 0.293%–0.582%).

Genetic diversity analysis of four Rhododendron populations

Among the twenty microsatellite markers, twelve were polymorphic in R. simsii population, accounting for 60% (Table 3, Supplemental Table 1, Supplemental Fig. 1). The Na ranged from 3 (RsE-17 and RsE-93) to 9 (RsE-70 and RsE-78), with an average of 5.42 in R. simsii population. Moreover, these markers gave out 2–5 alleles in both R. fortunei and R. mariesii populations, as well as 3–6 in R. molle population. The mean value of alleles were 2.8, 3.08, and 3.25 in R. fortunei, R. mariesii, and R. molle populations, respectively. However, RsE-90 locus was monomorphic in R. fortunei and R. molle populations, while RsE-70 could not also amplify any products in R. molle population, which might due to gene mutation.

Table 3 Size range, numbers of alleles (Na), observed (H_O) and expected (H_E) heterozygosities, and Hardy-Weinberg equilibrium (HWE) after Bonferroni correction (^*P < 0.004) in four different populations

Locus	Size range (bp)	R. fortunei (30 samples)				R. simsii (30 samples)				R. mariesii (30 samples)				R. molle (30 samples)
Locus	Size range (bp)	Na	Ho	H_E	HWE	Na	Ho	H_E	HWE	Na	Ho	H_E	HWE	Na	Ho	H_E	HWE
RsE-17	188–193	2	0.000	0.331	0.001^*	3	0.000	0.622	0.016	2	0.000	0.429	0.143	3	0.000	0.519	0.000^*
RsE-37	202–219	4	0.933	0.692	0.000^*	5	1.000	0.663	0.002^*	2	0.714	0.538	0.511	6	0.737	0.747	0.002^*
RsE-56	217–227	2	0.846	0.508	0.023	5	1.000	0.797	0.001^*	3	0.600	0.568	0.029	5	0.400	0.822	0.025
RsE-65	199–210	2	0.000	0.405	0.000^*	4	0.800	0.742	0.008	2	0.222	0.523	0.171	3	0.000	0.662	0.000^*
RsE-70	179–195	3	0.923	0.551	0.008	9	1.000	0.883	0.000^*	4	0.800	0.763	0.006	–	–	–	–
RsE-77	150–158	3	1.000	0.600	0.001^*	5	1.000	0.763	0.011	4	1.000	0.700	0.039	5	0.938	0.738	0.015
RsE-78	161–179	5	1.000	0.770	0.027	9	1.000	0.864	0.000^*	5	1.000	0.810	0.129	4	1.000	0.684	0.000^*
RsE-81	230–240	3	1.000	0.638	0.001^*	4	0.800	0.789	0.079	3	0.625	0.708	0.054	4	0.733	0.614	0.001
RsE-85	152–162	2	1.000	0.517	0.000^*	7	1.000	0.877	0.000^*	3	1.000	0.574	0.007	3	0.900	0.642	0.000^*
RsE-90	191–201		–	–	–	4	0.500	0.652	0.153	3	0.143	0.560	0.020	–	–	–	–
RsE-93	236–246	3	0.929	0.632	0.000^*	3	1.000	0.680	0.046	3	0.700	0.616	0.024	3	1.000	0.618	0.001
RsE-97	247–261	4	0.933	0.692	0.000^*	7	1.000	0.868	0.001^*	3	1.000	0.637	0.016	3	1.000	0.629	0.000^*
Mean	–	2.8	0.714	0.528	–	5.42	0.842	0.767	–	3.08	0.650	0.619	–	3.25	0.559	0.556	–
sd	–	1.1	0.414	0.199	–	2.02	0.293	0.091	–	0.86	0.340	0.105	–	1.74	0.426	0.259	–

In R. simsii population, the H_O and H_E ranged from 0 to 1 and 0.622 to 0.883, with the mean values of 0.842 and 0.767, respectively. Similarly, heterozygote excess was also observed in the other three populations, as the average values of H_O were all higher than H_E in R. fortunei (0.714 > 0.528), R. mariesii (0.650 > 0.619), and R. molle (0.559 > 0.556) populations (Table 3). Eight loci (RsE-17, RsE-37, RsE-65, RsE-77, RsE-81, RsE-85, RsE-93, and RsE-97) significantly deviated from Hardy-Weinberg equilibrium (Bonferroni-corrected P < 0.05/12 = 0.004) in R. fortunei population. Moreover, six locus were significantly deviated from Hardy-Weinberg equilibrium both in R. simsii population (RsE-37, RsE-56, RsE-70, RsE-78, RsE-85, and RsE-97) and R. molle population (RsE-17, RsE-37, RsE-65, RsE-78, RsE-85, and RsE-97). However, all these twelve locus were in Hardy-Weinberg equilibrium in R. mariesii population, inferring that these four species possessed different ecological adaptabilities.

Discussion

In this research, the deep transcriptome sequencing and de novo assembly of short-read sequences could provide reference transcriptomes for other Rhododendron species, as genomic data on Rhododendron species is very scarce. Among these unigenes, 65.834% could be annotated through homology searches against the already available databases, inferring that the transcriptome assembly formed a sound basis for future Rhododendron research. In particular, the assignment of 39.923% assembled transcripts under different GO categories at the levels of biological process, cellular component, and molecular function confirmed the broad functional diversity. The homologous genes involved in synthesis of flower pigments could be benefit for functional genomics studies in related Rhododendron species. The unannotated transcripts might be species-specific genes, untranslated regions, sequences with unknown domain, and even non-coding RNAs (Xing et al. 2017).

Based on non-redundant unigene sequences, 12,756, 13,294, 15,724 and 10,214 SSR loci had been mined from flower transcriptomes of R. fortunei, R. simsii, R. mariesii, and R. molle, which showed great potential for the study of genetic structure, diversity analysis, construction of genetic map, molecular marker association analysis, and underlying genetic basis of adaptive traits of Rhododendron species (Vasemägi et al. 2005, Zheng et al. 2013). The SSR distribution densities ranged from one SSR locus per 2.37 kb in R. fortunei to 2.58 kb in R. simsii, which were all higher than that of R. latoucheae (2.87 kb/SSR locus), rice (3.4 kb/SSR locus), Amorphophallus (3.63 kb/SSR locus), Triticum aestivum L. em. Thell (5.4 kb/SSR locus), Glycine max (7.4 kb/SSR locus), Arabidopsis (14 kb/SSR locus), and cotton (20 kb/SSR locus) (Cardle et al. 2000, Peng and Lapitan 2005, Xing et al. 2017, Zheng et al. 2013). The differences in SSR density in different species might be due to the software used, the search parameters for exploring SSR markers, as well as the different genome sizes (Varshney et al. 2002, Zheng et al. 2013). In particular, di-nucleotide repeat motifs (59.126%–64.314%) were the main types in these four Rhododendron species, followed by mono-nucleotides (19.571%–24.873%) and tri-nucleotides (15.149%–15.759%), which was in agreement with EST-SSR distributions in Amorphophallus, Cajanus cajan, spruce, Cucurbita pepo, Actinidia, and R. latoucheae (Crowhurst 2004, Dutta et al. 2011, Gong et al. 2008, Rungis et al. 2004, Xing et al. 2017, Zheng et al. 2013). However, the least repeat type differed in the four Rhododendron species: penta-nucleotide repeats in R. fortunei and R. simsii; hexa-nucleotide repeats in R. mariesii and R. molle.

The most abundant di-nucleotide repeat motif was AG/CT (55.18%–61.22%) in these four Rhododendron species, which was the same with that of Momordica charantia, Tall fescue, Ipomoea batatas, and Arachis hypogaea (Liang et al. 2009, Saha et al. 2004, Wang et al. 2010a, 2010b). AAG/CTT was the main tri-nucleotide repeat in R. fortunei, R. simsii, R. mariesii, and R. molle, which were in agreement with that of soybean and Epimedium sagittatum, but were different from that of Amorphophallus (AGG/CCT), wheat (AAC/TTG), maize (CCG/GGC), and barley (CCG/GGC) (Kantety et al. 2002, Zeng et al. 2010, Zheng et al. 2013). Moreover, CCG/CGG, a common SSR type in monocots, is very rare in dicotyledonous plants (Wang et al. 2011). In related to these four Rhododendron species, CCG/CGG occurred with low frequencies of 1.294% in R. fortunei, 1.173% in R. simsii, 1.208% in R. mariesii, and 1.155% in R. molle, respectively.

SNP markers, co-dominant, bi-allelic, and evenly distributed in the whole genome, are highly valuable for molecular and genetic research, as well as modern breeding. However, only a few SNPs have been reported in Rhododendron species. From floral and foliar transcriptome of R. arboreum, 811 high-quality SNPs in 719 contigs have been found, including 55.2% transitions, 33.8% transversions, and 9.2% indels (Choudhary et al. 2018). In this research, the C:G→T:A SNP was the main type in all four species, followed by T:A→C:G and T:A→A:T types, suggesting that these mutation types occurred more frequently in genomes of Rhododendron species. However, the least SNP types differed largely in the four species, inferring different gene mutation and molecular evolution occurred. In particular, base transitions were the highest SNP types, ranging from 57.692% (R. mariesii) to 60.901% (R. fortunei), which were in agreement with that of R. arboreum (60.2%) (Choudhary et al. 2018). For InDel markers, the “deletion markers” were more than “insertion markers” in all four species, inferring that “deletion mutation” took place more frequently in Rhododendron species. In particular, InDels with length of 1–3 nt were the main types.

Widespread and long-lived species possess high genetic diversity (Wang et al. 2013). Among the four Rhododendron populations, R. simsii had the highest genetic diversity (H_E: 0.767 ± 0.091), higher than that of R. decorum (H_E: 0.758 ± 0.048) and R. jinggangshanicum (H_E: 0.642 ± 0.200) (Li et al. 2015, Wang et al. 2013). However, R. fortunei population had the lowest genetic diversity (H_E: 0.528 ± 0.199), which was still higher than that of Daviesia suaveolens (H_E: 0.274 ± 0.056) and Firmiana danxiaensis (H_E: 0.364 ± 0.019) (Chen et al. 2014, Young and Brown 1996). In all four species, heterozygote excess was observed, as outcross took place frequently. However, genetic diversity differed largely, which might be caused by species difference, as well as different adaptabilities to the same environmental selection pressures.

Beside serving as reference transcriptomes for Rhododendron species, these transcriptome data will facilitate gene discovery and functional genomics studies in Rhododendron. The newly identified microsatellite, SNP, and InDel markers possess important implications for constructing high-density genetic linkage maps, assessing germplasm polymorphism and evolution, marker assisted selection, cloning functional gene, as well as protection and utilization of wild Rhododendron germplasm resources.

Acknowledgment

Grants from National Natural Science Foundation of China (NSFC 31500995) and Open fund of Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization (2017BX06).

Literature Cited

Bhattacharyya, D. (2011) Rhododendron species and their uses with special reference to Himalayas—a review. Assam University J. Sci. Tech. 7: 161–167.
Cardle, L., L. Ramsay, D. Milbourne, M. Macaulay, D. Marshall and R. Waugh (2000) Computational and experimental characterization of physically clustered simple sequence repeats in plants. Genetics 156: 847–854.
Chen, S.F., M. Li, R.F. Hou, W. Liao, R. Zhou and Q. Fan (2014) Low genetic diversity and weak population differentiation in Firmiana danxiaensis, a tree species endemic to Danxia landform in northern Guangdong, China. Biochem. Syst. Ecol. 55: 66–72.
Chopra, R., G. Burow, A. Farmer, J. Mudge, C.E. Simpson, T.A. Wilkins, M.R. Baring, N. Puppala, K.D. Chamberlin and M.D. Burow (2015) Next-generation transcriptome sequencing, SNP discovery and validation in four market classes of peanut, Arachis hypogaea L. Mol. Genet. Genomics 290: 1169–1180.
Choudhary, S., S. Thakur, R.A. Najar, A. Majeed, A. Singh and P. Bhardwaj (2018) Transcriptome characterization and screening of molecular markers in ecologically important Himalayan species (Rhododendron arboreum). Genome 61: 417–428.
Conesa, A., S. Götz, J.M. Garcíagómez, J. Terol, M. Talón and M. Robles (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21: 3674–3676.
Crowhurst, R.N. (2004) EST-derived microsatellites from Actinidia species and their potential for mapping. Theor. Appl. Genet. 108: 1010–1016.
Danecek, P., A. Auton, G. Abecasis, C.A. Albers, E. Banks, M.A. DePristo, R.E. Handsaker, G. Lunter, G.T. Marth, S.T. Sherry et al. (2011) The variant call format and VCFtools. Bioinformatics 27: 2156–2158.
Dutta, S., G. Kumawat, B.P. Singh, D.K. Gupta, S. Singh, V. Dogra, K. Gaikwad, T.R. Sharma, R.S. Raje, T.K. Bandhopadhya et al. (2011) Development of genic-SSR markers by deep transcriptome sequencing in pigeonpea [Cajanus cajan (L.) Millspaugh]. BMC Plant Biol. 11: 17.
Excoffier, L., G. Laval and S. Schneider (2005) Arlequin (Version 3.01): An integrated software package for population genetics data analysis. Computational and Molecular Population Genetics Lab (CMPG) Institute of Zoology University of Berne, Berne, Switzerland.
Fang, L., J. Tong, Y. Dong, D. Xu, J. Mao and Y. Zhou (2017) De novo RNA sequencing transcriptome of Rhododendron obtusum identified the early heat response genes involved in the transcriptional regulation of photosynthesis. PLoS ONE 12: e0186376.
Gong, L., G. Stift, R. Kofler, M. Pachner and T. Lelley (2008) Microsatellites for the genus Cucurbita and an SSR-based genetic linkage map of Cucurbita pepo L. Theor. Appl. Genet. 117: 37–48.
Grabherr, M.G., B.J. Haas, M. Yassour, J.Z. Levin, D.A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng et al. (2011) Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29: 644–652.
Kanehisa, M., S. Goto, Y. Sato, M. Furumichi and M. Tanabe (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40: D109–D114.
Kantety, R.V., M.L. La Rota, D.E. Matthews and M.E. Sorrells (2002) Data mining for simple sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and wheat. Plant Mol. Biol. 48: 501–510.
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis and R. Durbin (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079.
Li, M., S. Chen, S. Shi, Z. Zhang, W. Liao, W. Wu, R. Zhou and Q. Fan (2015) High genetic diversity and weak population structure of Rhododendron jinggangshanicum, a threatened endemic species in Mount Jinggangshan of China. Biochem. Syst. Ecol. 58: 178–186.
Liang, X., X. Chen, Y. Hong, H. Liu, G. Zhou, S. Li and B. Guo (2009) Utility of EST-derived SSR in cultivated peanut (Arachis hypogaea L.) and Arachis wild species. BMC Plant Biol. 9: 35.
Peng, J. and N.L.V. Lapitan (2005) Characterization of EST-derived microsatellites in the wheat genome and development of eSSR markers. Funct. Integr. Geno. 5: 80–96.
Popescu, R. and B. Kopp (2013) The genus Rhododendron: An ethnopharmacological and toxicological review. J. Ethnopharmacol. 147: 42–62.
Rungis, D., Y. Bérubé, J. Zhang, S. Ralph, C.E. Ritland, B.E. Ellis, C. Douglas, J. Bohlmann and K. Ritland (2004) Robust simple sequence repeat markers for spruce (Picea spp.) from expressed sequence tags. Theor. Appl. Genet. 109: 1283–1294.
Sağıroğlu, M., A. Arslantürk, Z.K. Akdemir and M. Turna (2012) An ethnobotanical survey from hayrat (Trabzon) and Kalkandere (Rize/Turkey). Biol. Divers. Conserv. 5: 31–43.
Saha, M.C., M.A.R. Mian, I. Eujayl, J.C. Zwonitzer, L. Wang and G.D. May (2004) Tall fescue EST-SSR markers with transferability across several grass species. Theor. Appl. Genet. 109: 783–791.
Varshney, R.K., T. Thiel, N. Stein, P. Langridge and A. Graner (2002) In silico analysis on frequency and distribution of microsatellites in ESTs of some cereal species. Cell Mol. Biol. Lett. 7: 537–546.
Vasemägi, A., J. Nilsson and C.R. Primmer (2005) Expressed sequence tag-linked microsatellites as a source of gene-associated polymorphisms for detecting signatures of divergent selection in Atlantic salmon (Salmo salar L.). Mol. Biol. Evol. 22: 1067–1076.
Wang, S., Z. Li, W. Jin, F. Xiang, J. Xiang and Y. Fang (2017) Development and characterization of polymorphic microsatellite markers in Rhododendron simsii (Ericaceae). Plant Spec. Biol. 32: 100–103.
Wang, S.Z., L. Pan, K. Hu, C.Y. Chen and Y. Ding (2010a) Development and characterization of polymorphic microsatellite markers in Momordica charantia (Cucurbitaceae). Am. J. Bot. 97: e75–e78.
Wang, X.Q., Y. Huang and C.L. Long (2013) Assessing the genetic consequences of flower-harvesting in Rhododendron decorum Franchet (Ericaceae) using microsatellite markers. Biochem. Syst. Ecol. 50: 296–303.
Wang, Z., B. Fang, J. Chen, X. Zhang, Z. Luo, L. Huang, X. Chen and Y. Li (2010b) De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas). BMC Genomics 11: 726.
Wang, Z., J. Li, Z. Luo, L. Huang, X. Chen, B. Fang, Y. Li, J. Chen and X. Zhang (2011) Characterization and development of EST-derived SSR markers in cultivated sweetpotato (Ipomoea batatas). BMC Plant Biol. 11: 139.
Xiao, S., Z. Han, P. Wang, F. Han, Y. Liu, J. Li and Z.Y. Wang (2015) Functional marker detection and analysis on a comprehensive transcriptome of large yellow croaker by next generation sequencing. PLoS ONE 10: e0124432.
Xiao, Z., J. Su, X. Sun, C. Li, L. He, S. Cheng and X. Liu (2018) De novo transcriptome analysis of Rhododendron molle, G. Don flowers by Illumina sequencing. Genes Genomics 40: 591–601.
Xing, W., J. Liao, M. Cai, Q. Xia, Y. Liu, W. Zeng and X. Jin (2017) De novo assembly of transcriptome from Rhododendron latoucheae Franch. using Illumina sequencing and development of new EST-SSR markers for genetic diversity analysis in Rhododendron. Tree Genet. Genomes 13: 53.
Xu, J.J., L.Y. Zhang, B. Zhao and H.F. Shen (2017) Assessment of genetic diversity among six populations of Rhododendron triflorum in Tibet using ISSR and AFLP markers. S. Afr. J. Bot. 108: 175–183.
Young, A.G. and A.H.D. Brown (1996) Comparative population genetic structure of the rare woodland shrub Daviesia suaveolens and its common congener D. mimosoides. Conser. Biol. 10: 1220–1228.
Zeng, S., G. Xiao, J. Guo, Z. Fei, Y. Xu, B.A. Roe and Y. Wang (2010) Development of a EST dataset and characterization of EST-SSRs in a traditional Chinese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim. BMC Genomics 11: 94.
Zheng, X., C. Pan, Y. Diao, Y. You, C. Yang and Z. Hu (2013) Development of microsatellite markers by transcriptome sequencing in two species of Amorphophallus (Araceae). BMC Genomics 14: 490.

Corresponding author

Register with J-STAGE for free!