2016 年 91 巻 2 号 p. 111-125
In quantitative gene expression analysis, normalization using a reference gene as an internal control is frequently performed for appropriate interpretation of the results. Efforts have been devoted to exploring superior novel reference genes using microarray transcriptomic data and to evaluating commonly used reference genes by targeting analysis. However, because the number of specifically detectable genes is totally dependent on probe design in the microarray analysis, exploration using microarray data may miss some of the best choices for the reference genes. Recently emerging RNA sequencing (RNA-seq) provides an ideal resource for comprehensive exploration of reference genes since this method is capable of detecting all expressed genes, in principle including even unknown genes. We report the results of a comprehensive exploration of reference genes using public RNA-seq data from plants such as Arabidopsis thaliana (Arabidopsis), Glycine max (soybean), Solanum lycopersicum (tomato) and Oryza sativa (rice). To select reference genes suitable for the broadest experimental conditions possible, candidates were surveyed by the following four steps: (1) evaluation of the basal expression level of each gene in each experiment; (2) evaluation of the expression stability of each gene in each experiment; (3) evaluation of the expression stability of each gene across the experiments; and (4) selection of top-ranked genes, after ranking according to the number of experiments in which the gene was expressed stably. Employing this procedure, 13, 10, 12 and 21 top candidates for reference genes were proposed in Arabidopsis, soybean, tomato and rice, respectively. Microarray expression data confirmed that the expression of the proposed reference genes under broad experimental conditions was more stable than that of commonly used reference genes. These novel reference genes will be useful for analyzing gene expression profiles across experiments carried out under various experimental conditions.
Gene expression levels are spatiotemporally regulated to maintain and control biological processes, such as developmental events and responses to environmental stimuli. Therefore, monitoring gene expression levels has played an important role in research addressing the molecular mechanisms underlying biological functions. Among the methods for gene expression analysis, reverse-transcription quantitative real-time polymerase chain reaction (RT-qPCR) has been considered the gold standard method to quantify transcript levels because of its wide dynamic range and its high specificity, sensitivity and throughput (Bustin et al., 2005; Hoebeeck et al., 2007; Schmittgen et al., 2008). Whereas comprehensive methods using microarray and next-generation sequencing technologies have been developed for high-throughput analysis of gene expression profiles, RT-qPCR still has advantages in quickness, sensitivity, accuracy and cost in targeting analysis.
While RT-qPCR is accepted as an accurate method, special attention must be paid to errors caused by machine, enzyme and pipet variation (Bustin, 2002; Vandesompele et al., 2002). To eliminate such errors and prevent misinterpretation of results, normalization using a reference gene as an internal control is frequently performed. A reference gene will ideally have been validated to be expressed at a constant level over a range of experimental conditions (Vandesompele et al., 2002). Traditionally, since so-called ‘housekeeping’ genes that play an essential role in the cell, such as in the cytoskeleton and glycolysis, were expected to be expressed stably, they have been used as the reference genes. On the other hand, contrary to the desirability of stable expression for reference genes, it has also been demonstrated that expression of the classic housekeeping genes is actually not always constant (Thellin et al., 1999). This suggests that reference genes should be validated under each experimental condition. However, in principle, evaluating the expression stability of a candidate gene is fairly difficult, unless we have a reliable way to quantify gene expression level without normalization by a reference gene; thus, a circularity problem arises (Vandesompele et al., 2002; Andersen et al., 2004). Similarly, selecting a reference gene simply on the basis that it shows the most stable expression in the samples under analysis is inadequate because the errors to be normalized are totally unknown. This is the reason why a reference gene is needed. Thus, as a practical solution, it is desirable that stable expression of a reference gene should be validated by independent evidence, such as public transcriptomic data.
Several computational tools have been published to help researchers select suitable reference genes. For instance, geNorm evaluates a given set of genes by comparing variation in the expression ratios in all combinations of two genes in a given set of genes and samples (Vandesompele et al., 2002). This method is based on the principle that “the expression ratio of two ideal internal control genes is constant in all samples”, enabling us to avoid the circularity problem (Vandesompele et al., 2002). However, depending on the preselected gene set, geNorm may overestimate the expression stability of genes that are actually not expressed stably (Andersen et al., 2004). Normfinder, another tool addressing the circularity problem, employs a mathematical model to describe expression stability by considering variation both within a sample group and among sample groups (Andersen et al., 2004). In the use of this tool, there is a requirement that the preselected genes are not expressed differently (Andersen et al., 2004). Therefore, before analysis with Normfinder, Andersen et al. (2004) prepared a gene set by genome-wide evaluation of gene expression stability with microarray transcriptomic data using the coefficient of variation (CV) value as an index.
To survey superior reference genes in Arabidopsis thaliana (Arabidopsis), Czechowski and colleagues utilized comprehensive microarray data of seven series, including 323 experimental conditions. Expression stability of all genes on the Affymetrix GeneChip Arabidopsis ATH1 Genome Array (NCBI accession number GPL198) was evaluated with these microarray data using the CV (Czechowski et al., 2005). The top-ranked reference gene candidates in this survey outperformed classic housekeeping genes in expression stability (Czechowski et al., 2005), indicating that comprehensive exploration is essential to gain the best choice for reference genes. Whereas the Arabidopsis ATH1 microarray, the predominant commercial microarray used for expression analysis in Arabidopsis, contains approximately 22,000 gene-specific probe sets (Redman et al., 2004), a total of 33,602 genes are now annotated on the Arabidopsis genome (TAIR10; Lamesch et al., 2012). Hence, it is possible that further superior reference genes exist but have not been identifiable from microarray data in Arabidopsis, which is the major plant model. Furthermore, non-model and minor model plants are less well supported by microarray data because of the poor provision of commercial microarray chips.
RNA sequencing (RNA-seq) is a method for transcriptomic analyses with several advantages over microarray analysis, such as a broader dynamic range, and no dependence on either a probe set or the availability of a commercial microarray chip. In particular, RNA-seq data should be an ideal source for reference gene exploration, since all expressed genes, including any currently unknown, can be detected in principle. Recently, the advantageous features of RNA-seq are driving the accumulation of large-scale gene expression data for various species, including crop species. Public RNA-seq data are therefore becoming useful for exploring reference genes, not only in the major model species but also in crops. Furthermore, the accumulation of comprehensive gene expression data is making it possible to compare gene expression profiles across experimental conditions and species. For instance, a web-based omics database, the Plant Omics Data Center (PODC, http://bioinf.mind.meiji.ac.jp/podc/), has been established with a search function for gene expression network constructed by correspondence analysis (Yano et al., 2006) using the global RNA-seq data of eight plant species: Arabidopsis, Glycine max (soybean), Medicago truncatula (medicago), Oryza sativa (rice), Solanum lycopersicum (tomato), S. tuberosum (potato), Sorghum bicolor (sorghum) and Vitis vinifera (grapevine) (Ohyanagi et al., 2015). When one attempts to perform RT-qPCR analysis to validate the expression pattern found in such global profiling, a reference gene(s) that is usable in broad experimental conditions will be required. The existence of such a broadly available reference gene(s) has been implied by the genome-wide exploration of Arabidopsis reference genes: several genes including AT1G13320 were selected as superior reference genes in multiple experimental series (Czechowski et al., 2005).
Another potential advantage of using RNA-seq data to select reference genes is that RNA-seq data do not require reference gene-based normalization, thus avoiding the circularity problem. An expression level measured by RNA-seq is represented as a value called ‘fragments per kilobase of exon per million mapped fragments’ (FPKM), which is a value normalized by total fragment number and length of each transcript. Therefore, RNA-seq data should be free of the technical errors to be eliminated in RT-qPCR analysis by the use of reference genes (Zhan et al., 2014).
In this study, we initially attempted to explore superior reference genes that can be used under broad experimental conditions in the eight plant species supported by PODC. After characterization of RNA-seq data collected from a public database, we focused on Arabidopsis, soybean, tomato and rice for the exploration, and propose a set of reference genes for each species. Each of the proposed reference genes covered 24–50% of the experimental conditions under which the RNA-seq data were obtained and showed similar or better stability compared with known reference genes.
All available RNA-seq short-read data of Arabidopsis thaliana (Arabidopsis), Glycine max (soybean), Medicago truncatula (medicago), Oryza sativa (rice), Solanum lycopersicum (tomato), Solanum tuberosum (potato), Sorghum bicolor (sorghum) and Vitis vinifera (grapevine) were obtained from the Sequence Read Archive database (SRA, http://www.ncbi.nlm.nih.gov/sra) (Leinonen et al., 2011). Referring to experimental descriptions available in the SRA database, RNA-seq data obtained from technically biased RNA and small RNA libraries were roughly excluded manually. Quality control and mapping of reads, and calculation of gene expression levels, were carried out as previously described (Ohyanagi et al., 2015) with some version updates of tools and genome files: cutadapt (version 1.4.1) (Martin, 2011), tophat2 (version 2.0.12) (Kim et al., 2013) applying the ‘--no-novel-juncs’ option, bowtie2 (version 2.2.3) (Langmead and Salzberg, 2012), and cufflinks software (version 2.2.1) (Trapnell et al., 2013) applying the ‘–u’ option; and potato reference genome and genome annotation (version 4.03) (The Potato Genome Sequencing Consortium, 2011; Sharma et al., 2013). The RNA-seq data were further filtered by mapping rate after the quality control. Gene expression levels were calculated as FPKM values for each gene model (i.e., unique transcript).
After the calculations of FPKM values, RNA-seq data obtained from biased RNA and small RNA libraries were again removed by the following method. First, such libraries were explored by text mining against the experimental descriptions using an in-house PERL script. The PERL script was designed to detect RNA-seq data with a description containing keywords related to small RNA and fractionated RNA: ‘miRNA’, ‘siRNA’, ‘smRNA’, ‘sRNA’, ‘ncRNA’, ‘tRNA’, ‘snRNA’, ‘exRNA’ and ‘piRNA’. The script also detects co-occurrence of ‘RNA’ and ‘short’, ‘micro’, ‘interference’, ‘tasi’, ‘non-coding’, ‘transfer’ or ‘degradome’. Experimental descriptions where the terms were detected by the PERL script were again manually checked to exclude RNA-seq data obtained from biased RNA and small RNA libraries.
The selection of reference gene candidates was performed using an in-house PERL script. To predict possible gene-specific primers, sequences similar to the transcript of selected reference genes were searched for in the transcript database of each species using the blastn function of the NCBI Basic Local Alignment Search Tool (BLAST, version 2.2.30+) (Camacho et al., 2009). If a highly similar sequence was detected by blastn, the possibility of designing gene-specific primer(s) was evaluated. We designed primers against the transcript of the reference gene and the highly similar transcript using the Primer3 web tool (version 0.4.0, http://bioinfo.ut.ee/primer3-0.4.0/) (Untergasser et al., 2012), and tested their gene specificity.
Microarray expression data obtained using the Arabidopsis ATH1 Genome Array and Soybean Genome Array (Affymetrix) were downloaded from the Gene Expression Omnibus database (GEO, http://www.ncbi.nlm.nih.gov/geo/) (Barrett et al., 2013) and the Arabidopsis Information Resource database (TAIR, https://www.arabidopsis.org/) (Lamesch et al., 2012) (see legends to Figs. 5 and 6 for accession numbers of the data sets). Normalization of the data was performed by the MAS5 method using R software (version 3.0.1, https://www.r-project.org/) with the software package ‘affy’ (Gautier et al., 2004). Matching between microarray probes and genes was performed based on blastn searches for two rice microarrays, the Affymetrix Rice Genome Array and the Agilent Rice Gene Expression 4x44K Microarray. Affymetrix probe sets were considered to be gene-specific when all 11 of the consensus probes were perfectly matched (100% identity and 100% coverage) with a single gene. Agilent probes were considered to be gene-specific when there was a single gene showing a perfect match. For Affymetrix Genome Arrays of Arabidopsis, soybean and tomato, probe-gene matches were searched for in annotation lists provided by genome databases: TAIR for Arabidopsis, SoyBase (http://www.soybase.org) for soybean (Grant et al., 2010) and Sol Genomics Network (http://solgenomics.net/) for tomato (The Tomato Genome Consortium, 2012).
Comparison of expression stability between Arabidopsis reference genes proposed in this study and other known reference genes. Microarray expression data investigating responses to cold (GEO accession no. GSE3326 and TAIR submission no. ME00325), heat (ME00339), salt (ME00328) and oxidative stress (ME00340), as well as to treatment with brassinosteroids (ME00335 and ME00352) and gibberellin (ME00343 and ME00350), were used to evaluate the reference genes proposed in this study (Lee et al., 2005; Kilian et al., 2007; Goda et al., 2008). A Coefficients of variation (CVs) calculated across the microarray data are represented in a heatmap with probe identifiers (IDs) and gene annotations. B A boxplot drawn with relative expression levels, which were scaled by division with the mean expression level of each gene (probe). The reference genes proposed in this study are shown in red.
Comparison of expression stability between a novel soybean reference gene and known reference genes. Microarray expression data obtained from different tissues, including leaves, roots and nodules (GSE18822), seeds during development (GSE21598 and GSE26443) and seedlings treated with salt stress (GSE41125), were used to evaluate a novel reference gene of soybean, Glyma.13G041500 (Yang et al., 2010; Asakura et al., 2012). A CVs calculated across the microarray data are represented in a heatmap with probe IDs and gene annotations. B A boxplot drawn with relative expression levels, which were scaled by division with mean expression level of each gene (probe). The reference gene proposed in this study is shown in red.
To explore candidates for reference genes using as many and as varied gene expression data as possible, all RNA-seq data currently available were downloaded from the SRA database. In the repository database, RNA-seq data were arranged in a hierarchical structure with ‘Study’ as the largest structure corresponding to a project including ‘Run(s)’, which is the smallest structure typically corresponding to the individual library prepared from a single biological replicate. The Runs were filtered by RNA type sequenced to select data from an ordinary messenger RNA library, but not technically biased or small RNA libraries, and by mapping rate (> 70%) to control reliability on expression levels; next, Studies where only a single Run remained were excluded. At the step of filtering by mapping rate, 1.8–28.5% of Runs (714 of 2,506 in Arabidopsis, 34 of 915 in soybean, 63 of 995 in tomato, 150 of 878 in rice, 47 of 199 in grapevine, 29 of 128 in medicago, 7 of 130 in potato and 2 of 114 in sorghum) were removed.
After this, while 126, 39, 34 and 33 Studies remained for Arabidopsis, soybean, tomato and rice, only seven or fewer Studies remained for grapevine, potato, sorghum and medicago, suggesting that biological variation is more biased in species with fewer numbers of Studies. Classification of Runs by anatomic origin (e.g., whole plant, root, stem, leaf, flower, aleurone layer and shoot apical meristem) indicated an over-concentration of one or two classes of origin in grapevine (fruits and leaves), potato (tuber), sorghum (roots) and medicago (roots and root nodules) (Fig. 1). Furthermore, the CV values for each gene model of these species were likely to be distributed in a narrower range compared with those of the species with larger numbers of Studies (Fig. 2). Therefore, we focused hereafter on Arabidopsis, soybean, tomato and rice.
Composition of anatomic origins in RNA-seq data used in this study. RNA-seq data were classified by anatomic origin, and the composition is represented in a 100% stacked column chart. The color legend for the classes is shown below the chart. At, Arabidopsis thaliana; Gm, Glycine max; Mt, Medicago truncatula; Os, Oryza sativa; Sb, Sorghum bicolor; Sl, Solanum lycopersicum; St, Solanum tuberosum; Vv, Vitis vinifera.
Distribution of coefficients of variation. The distribution of the CVs of all gene models is represented for each species in this boxplot. The CV for each gene model was calculated from FPKM values of all RNA-seq data used in this study. The y-axis is shown in logarithmic scale. Numbers above the boxes indicate the number of Studies (upper) and Runs (lower, in parentheses). Species abbreviations are as in Fig. 1.
Using the gene expression data selected as described above and CV as an index to evaluate variation in gene expression, reference gene candidates were explored through four steps: (1) if a minimum value of FPKM in a Study was less than 10, the Study was excluded from further analysis for the gene; (2) if CV among all Runs in each Study (CVe) was larger than 0.1, the Study was excluded from further analysis for the gene; (3) if CV among all Runs in all Studies that remained after the second step (CVa) was larger than 0.15, the Study showing the maximum deviation (absolute value calculated by subtraction between the mean of FPKMs among all Runs in the Study and the median of FPKMs among all Runs in all Studies that remained) was excluded and this step was repeated until the CVa became 0.15 or less; and (4) genes were ranked in order of number of Studies retained (Fig. 3).
Flowchart representing the procedure used to select top reference gene candidates. RNA-seq data for gene expression profiles and sample information, such as inclusive relationships between Studies and Runs, were obtained from the Sequencing Read Archive database (http://www.ncbi.nlm.nih.gov/sra). Rounded-corner rectangles, rectangles, and rhombi indicate input and output, processes, and conditional expression.
The first step was performed to control the lower limit of a gene expression level. To determine the lower limit, two matters were taken into account. First, reference genes must show an expression level which is high enough to be quantified easily, or at least without difficulty, in applications, particularly RT-qPCR. To determine a practical threshold to evaluate the ease of quantification, distributions of FPKM values were investigated in 16 genes that have been practically used in RT-qPCR as reference genes by researchers (Abiko et al., 2005; Benschop et al., 2007; Papdi et al., 2008; Streitner et al., 2008; Li et al., 2009; Li et al., 2010; Fontaine et al., 2012; Kudo et al., 2012; Zhang et al., 2013; Wang et al., 2014; Yin et al., 2014; Gonzalez-Cabanelas et al., 2015; Kamada-Nobusada et al., 2015; Patil et al., 2015; Zhai et al., 2015). In 15 of the 16 genes, the 25th percentile of the FPKM values was above 10; the exception was the Arabidopsis PPR gene, all of whose FPKM values were below 10 (Supplementary Fig. S1). This result suggests that around 10 is appropriate as the threshold FPKM value. Second, an FPKM value representing the gene expression level must be reliable. Mortazavi et al. (2008) reported that 1.0 RPKM (equivalent to 1.0 FPKM in this study) or above of a typical transcript is stochastically robust when calculated with 40 million mapped reads. In most of the RNA-seq data used in this study, mapped read number was less than 40 million but more than 4 million. Therefore, instead of the threshold 1.0 given in the previous report (Mortazavi et al., 2008), 10 FPKM is again considered to be an appropriate threshold value. Taken together with the range of FPKM values of practical reference genes, we set the FPKM threshold to 10. After selection with this threshold in the first step, the distributions of CVe and CVa values were narrowed down, although it was not the purpose of this step (Fig. 4, Supplementary Fig. S2), indicating that the low expression levels made a considerable contribution to the large variation in gene expression.
Effectiveness of each selection step for Arabidopsis reference genes. Distributions of coefficient of variation (CV) values of expression levels of a gene model among all Runs in each Study (CVe, left panel) and among all Runs in all Studies (CVa, right panel) remaining after each selection step in Arabidopsis. Enlarged plots for step 2 and step 3 are also shown in the plot area of each panel.
The second and third steps were both implemented to control variation in gene expression levels: the second step judged if a gene is eligible to be used as a reference gene by evaluating its expression stability in the experimental conditions of each Study; and the third step detected Studies where a gene showed nearly the same expression level. Thus, the gene would be usable as a reference gene across the experimental conditions of all Studies remaining after these steps. In a previous report exploring new reference genes using CV as an index, CV values of proposed reference genes ranged from 0.031 to 0.194, with a median of 0.078, in each experimental series (Czechowski et al., 2005). Given these empirical CV values, 0.1 was employed as a threshold CVe for evaluating the expression stability of a gene in a single Study. On the other hand, when evaluating the expression stability of the gene across multiple Studies, a slightly relaxed value of 0.15 was used as the threshold CVa for all FPKM values in the Studies. After the second step, there still remained high CVa values, and the distribution of CVe was not affected by the third step (Fig. 4, Supplementary Fig. S2). These results indicate the importance of the third step to control variation in gene expression across multiple experimental conditions.
All genes remaining after the third step should be usable under the experimental conditions of the remaining Studies. However, to provide information about versatile reference genes, genes were ranked by number of remaining Studies at the fourth step. The top 10 ranked genes, shown in Table 1, are particularly strong candidates for a reference gene that can be used under broader experimental conditions. These 13, 10, 12 and 21 genes of Arabidopsis, soybean, tomato and rice, respectively, include equally ranked genes (e.g., four genes were ranked equal 10th in Arabidopsis). Each of these candidates covered 24%–50% of Studies (Table 1). Collectively, 95 of 126 (75.4%), 23 of 39 (59.0%), 25 of 34 (73.5%) and 16 of 33 (48.5%) Studies of Arabidopsis, soybean, tomato and rice, respectively, were supported by the reference gene candidates listed in Table 1 (Supplementary Tables S1–S4). Except for soybean Glyma.04G076700 and rice Os03g0833300, it would be possible to design a gene-specific primer for the proposed reference genes (Table 1).
To validate our method, we compared the variation in expression levels between our reference gene candidates and existing reference genes that have been previously used or proposed. Public microarray data, which are independent from the RNA-seq data used to select the candidate genes, were used as gene expression data for the validation.
Two of our candidate genes, AT4G26410 and AT1G13320 (Table 1), were examined for the validation in Arabidopsis. Interestingly, these genes had been proposed as reference genes in a previous study: AT4G26410 was recommended for biotic stress conditions and AT1G13320 for developmental, abiotic stress and light conditions, but neither were recommended for hormone treatments (Czechowski et al., 2005). In our analysis, AT4G26410 and AT1G13320 were stably expressed in 50 and 52 of 126 Studies, respectively. Contrary to the previous report, both genes showed stable expression under abiotic stress conditions, namely cold (SRA accession no. RP029896), heat (SRP032366), salt (SRP029598) and oxidative stresses (SRP047297), and also under hormone treatments with brassinosteroids (SRP012153, SRP031626, SRP032274, SRP010642) and gibberellin (SRP010642) (Supplementary Table S1). Public microarray data obtained from similar experiments were collected from the GEO database, and subjected to calculation of relative expression levels and CV values to evaluate the variation among the abiotic stress and hormone treatment conditions.
As Arabidopsis reference genes, GPC2 (AT1G13440), ACT2 (AT3G18780), UBQ10 (AT4G05320), TUB6 (AT5G12250) and EF-1a (AT5G60390) have been traditionally used; and 18 genes (AT1G47770, AT1G58050, AT1G62930, AT2G07190, AT2G28390, AT2G32170, AT3G01150, AT3G32260, AT3G53090, AT4G27960, AT4G33380, AT4G34270, AT4G38070, AT5G08290, AT5G12240, AT5G15710, AT5G46630, AT5G55840) were previously proposed in addition to AT1G13320 and AT4G26410 (Czechowski et al., 2005). Twenty of these 23 existing reference genes were employed in the comparison; EF-1a, AT1G58050 and AT4G27960 were omitted because gene-specific probes for these three are available on the Affymetrix Arabidopsis ATH1 Genome Array (Fig. 5).
The CV values of AT4G26410 (0.149) and AT1G13320 (0.155) were lower than those of the other existing reference genes (0.165–0.514) (Fig. 5A). Moreover, a boxplot visualizing the distribution of expression levels showed narrower distribution for these two genes than for the others (Fig. 5B, red). Thus, the stable expression of AT4G26410 and AT1G13320 under abiotic stress and hormone treatment conditions was confirmed with data sets derived from different platforms, supporting the adequacy of our method. In addition to AT4G26410 and AT1G13320, our results also proposed AT3G42050 as a reference gene under the same experimental conditions (Supplementary Table S1). However, this gene could not be tested since there is no gene-specific probe for AT3G42050 on the Affymetrix Arabidopsis ATH1 Genome Array, as mentioned above (Table 1).
To further validate our method, the highest-ranked soybean gene model Glyma.13G041500.1 was compared with known reference genes as described above for the Arabidopsis reference genes.
As reference genes in soybean, EF1b (Glyma.02g276600), TUB4 (Glyma.03g124400), ACT2 (Glyma.04g215900), TUA5 (Glyma.05g157300), UBQ10 (Glyma.07g199900), CYP2 (Glyma.12g024700) and ACT11 (Glyma.18g290800) have commonly been used (Hu et al., 2009); and UKN2 (Glyma.06g038500), HDC (Glyma.08g050200), UKN1 (Glyma.12g020500), SKIP16 (Glyma.12g051100), TIP41 (Glyma.20g130700), ACTB (Glyma.15g050200) and TUA2 (Glyma.20g136000) were previously proposed (Hu et al., 2009; Nakayama et al., 2014). Among these existing reference genes, TUB4, TUA5, UKN2, SKIP16 and TUA2 were employed in the comparison, because gene-specific probes are available for them on the Affymetrix Soybean Genome Array.
The transcript level of Glyma.13G041500.1 satisfied the criteria on the CV values in 15 Studies, including research on the response to salt stress (SRP042248) and profiling among different tissues, including developmental series of seeds (SRP006767) (Supplementary Table S2). Thus, microarray expression data obtained from samples under a salt stress condition and a control condition (GSE41125), different organs (GSE18822), and developing seeds (GSE21598 and GSE26443) were collected for the comparison. Using these data, variation in expression levels was compared for Glyma.13G041500 and the existing reference genes by CV and boxplotting. Glyma.13G041500 showed a CV value of 0.123, which was the lowest among the genes tested (Fig. 6A), and the distribution of expression levels of Glyma.13G041500 appeared to be narrower than those of other genes (Fig. 6B, red). This observation was confirmed by plotting the expression levels of Glyma.13G041500, TUA5 and SKIP16 in each biological sample (Supplementary Fig. S3). These results showed that expression of Glyma.13G041500 is more stable than that of other genes under the tested experimental conditions. In conjunction with the result of validation in Arabidopsis, we concluded that our method successfully selected reference gene candidates that are stably expressed under broad experimental conditions.
To evaluate the impact of using RNA-seq data to explore reference genes, the availability of gene-specific probes was searched on the major commercial microarray chips: Affymetrix Arabidopsis ATH1 Genome Array for Arabidopsis, Affymetrix Soybean Genome Array for soybean (NCBI accession number GPL4592), Affymetrix Tomato Genome Array for tomato (GPL4741), and Affymetrix Rice Genome Array and Agilent Rice Gene Expression 4x44K Microarray for rice (GPL2025 and GPL6864, respectively). Probes matching the reference gene candidates were first searched, and then cross-matching with other genes was checked to see if the probes worked in a gene-specific manner, as described in MATERIALS AND METHODS. This search revealed that among the 53 reference gene candidates, nine (two from Arabidopsis, six from soybean and one from tomato) were found by using RNA-seq, but not by the major microarrays because of the lack of a gene-specific probe (Table 1). Notably, tomato Solyc03g044140.2 was the highest-ranked reference gene candidate, showing stable expression in 50% of all Studies (Table 1), emphasizing the importance of using RNA-seq data in a comprehensive survey for reference genes.
In this study, using RNA-seq data, we propose a total of 53 genes as candidates for reference genes that show stable expression in a broad spectrum of experimental conditions in Arabidopsis, soybean, tomato and rice (Table 1). Among them, nine genes were found never to be selected by using microarray data obtained with the major commercial chips due to the lack of corresponding gene-specific probes (Table 1). Notably, a tomato gene, Solyc03g044140.2, showed stable expression in half of the Studies examined, and ranked highest in tomato (Table 1). These results demonstrate the benefit of using RNA-seq data for an exhaustive survey of reference genes.
Because of the method employed in this study, our reference gene candidates are varied in their applicable experimental conditions (Supplementary Tables S1–S4). Therefore, it is hard to evaluate the stability of all candidate genes under the various conditions. Nevertheless, all three genes employed for the validation tests are more stably expressed in multiple conditions compared with other existing reference genes even in microarray data sets (Figs. 5 and 6). Since RNA-seq is generally accurate and less affected by technical variation in transcript quantification (Marioni et al., 2008; Fu et al., 2009), we expect that not only the three evaluated genes but also the other highly ranked genes are strong candidates for a utility reference gene that is usable under broad experimental conditions.
In plant species including Arabidopsis, soybean, tomato and rice, efforts have been devoted to exploring and validating reference gene candidates (Czechowski et al., 2005; Jain et al., 2006; Gutierrez et al., 2008; Hu et al., 2009; Hong et al., 2010; Narsai et al., 2010; Dekkers et al., 2012; Ji et al., 2014; Nakayama et al., 2014). Among approximately 90 reference gene candidates mentioned in these previous studies, only two Arabidopsis genes were selected as our reference gene candidates (Table 1, Fig. 5) and the Arabidopsis YLS8 gene (AT5G08290; Czechowski et al., 2005), the tomato TIP41-like protein coding gene (Solyc10g049850.1; Dekkers et al., 2012) and the rice PPP6 gene (Os01g0691700; Ji et al., 2014) were found in the 100 top-ranked genes in this study (data not shown). These results strongly suggest that the reference genes proposed through conventional methods and classically used are not always the best choices, especially in an experiment with broad conditions. Our method is useful to explore reference genes that are suitable for such experiments but rarely selected by conventional methods.
We employed four filtering steps in the selection of reference genes: the first step was to control basal expression level; the second step was to control expression stability in each experiment; the third step was to control the stability over the experiments; and the last step was to rank genes by the number of experiments in which the gene was expressed stably. Among these, the second step removed most of the Studies in Arabidopsis and tomato reference genes, but the first step typically showed comparable contributions to the second step in soybean, as well as a predominant contribution in rice (data not shown). This difference among the species may be caused by biological characteristics of the species in gene expression profile, redundancy of gene models in reference genomes used for short-read mapping, and a variety of experimental conditions in the RNA-seq data.
Among the RNA-seq Studies used in this study, 31, 16, nine and 17 Studies of Arabidopsis, soybean, tomato and rice, respectively, were assigned no reference gene candidate (Supplementary Tables S1–S4). This result shows that there exist cases of experimental conditions for which a suitable reference gene should be explored by focusing on a particular condition. When lower-ranked genes were overlooked, 23, eight, six and four Studies, respectively, were assigned at least one gene satisfying the criteria of CV values used in this study. Furthermore, if 0.2 was applied as the threshold for CVe, which is further relaxed but still within an empirically acceptable range, most of the remaining Studies were assigned a potential reference gene, except for a single Study of Arabidopsis (ERP002233) and soybean (SRP021098), and four Studies of rice (SRP002106, SRP007395, SRP010960 and SRP015433). It seems that these six Studies can be categorized into three classes by features of their biological samples: tissues spatially finely dissected, tissues inoculated with pathogens, and varied organs. For instance, three Studies of Arabidopsis (ERP002233), soybean (SRP021098) and rice (SRP007395) include Runs for libraries prepared from root or seed tissues dissected by laser capture microdissection, manual dissection or fluorescence-activated cell sorting. These observations indicate that a suitable reference gene can be found in most cases, if exploration is conducted in an experimental condition-specific manner, but also that there are cases where no stably expressed gene is found in the experimental condition.
Interestingly, we found that a soybean gene (Glyma. 17G163700.1) and two tomato genes (Solyc03g044140.2 and Solyc02g064700.2) (see Table 1) encode the catalytic α subunit of casein kinase II (CK2A), which is an essential housekeeping protein kinase (Mulekar and Huq, 2015). Also, in Arabidopsis and rice, one of the CK2A genes was moderately highly ranked in our analysis: AT2G23070 ranked 70th, supporting 41 of 126 Studies, and Os07g0114400 ranked 206th, supporting five of 33 Studies. Among these CK2A genes, AT2G23070 encodes a chloroplast-localized CK2A (αcp) (Salinas et al., 2006), and deduced polypeptides encoded by Glyma.17G163700.1 and Solyc02g064700.2 were predicted to be plastid-localized proteins by the WoLF PSORT program (Horton et al., 2007). These results indicate that CK2A genes, including αcp genes, are potentially useful as a reference gene across multiple plant species. Here, it should again be noted that the CK2A-encoding Solyc03g044140.2 is not supported by a gene-specific probe on the Affymetrix Tomato Genome Array.
We excluded grapevine, potato, sorghum and medicago from the comprehensive exploration of reference genes because of their lower sample variation in RNA-seq data (Figs. 1 and 2). An idea for exploring a reference gene in such species is to test genes orthologous to a reference gene used in other species, as previously conducted in several plant species including grapevine, tomato and potato (Reid et al., 2006; Dekkers et al., 2012; Mariot et al., 2015). To briefly check if this might work with our candidates, αcp and other CK2A genes were explored and evaluated as reference genes in the four excluded species. Putative αcp genes of grapevine (VIT_207s0129g00410), sorghum (Sobic.001G080700) and medicago (Medtr4g095400) ranked 13th, 3rd and 34th with three of seven, one of five and two of five Studies remaining, respectively. In potato, whereas no αcp gene was found, a putative cytosolic CK2A gene (PGSC0003DMG400024939) ranked 2nd, with four of six Studies remaining. Thus, one of the CK2A genes may also be suitable as a reference gene in these plants, depending on experimental conditions, suggesting the potency of ortholog-based selection. While a number of plant databases, such as PODC (Ohyanagi et al., 2015), ATTED-II (Obayashi et al., 2014), OryzaExpress (Hamada et al., 2011), eFP Browsers (Winter et al., 2007), UniVIO (Kudo et al., 2013) and qTeller (http://www.qteller.com/), are dealing with plant transcriptomic data, direct information on reference gene candidates has not been provided in the databases. A database implemented with a search function for reference genes, combined with information on orthologs and experimental conditions, would be helpful to explore reference gene candidates in species with insufficient accumulation of transcriptomic data. Our reference genes should also be helpful for administrators of the databases to evaluate expression profiles used or to be used in the databases.
Recently, large-scale association/modeling studies using transcriptomic data have emerged (Nagano et al., 2012; Zou et al., 2012). It will not be surprising if such omics-based association studies reveal a key transcript whose accumulation level is correlated with an important agricultural trait of crop plants in the near future. Once this has been achieved, RT-qPCR may be an essential technique for rapid line selection in crop breeding, and utility reference genes should be useful for that. To improve the accuracy of selection of proper reference genes, and to provide useful information on reference genes for plant science and crop breeding, further accumulation of public RNA-seq data of model and non-model plants, including crops, is desired.
This study was supported in part by MEXT Grants-in-Aid for Scientific Research on Innovative Areas (No. 26113716 to K.Y., No. 23113002 to S.T., No. 23113006 to G.S., K.S. and M.W., No. 23113005 to M.M., No. 23113001 to S.T., G.S., M.W. and M.M.); Scientific Research (A) (No. 25252001 to M.W.) from the Japan Society for the Promotion of Science; MEXT-Supported Program for the Strategic Research Foundation at Private Universities (2014–2018); and Research Funding for Computational Software Supporting Program from Meiji University to K.Y.. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.