2016 Volume 66 Issue 1 Pages 100-115
Recent advancements in genomic analysis technologies have opened up new avenues to promote the efficiency of plant breeding. Novel genomics-based approaches for plant breeding and genetics research, such as genome-wide association studies (GWAS) and genomic selection (GS), are useful, especially in fruit tree breeding. The breeding of fruit trees is hindered by their long generation time, large plant size, long juvenile phase, and the necessity to wait for the physiological maturity of the plant to assess the marketable product (fruit). In this article, we describe the potential of genomics-assisted breeding, which uses these novel genomics-based approaches, to break through these barriers in conventional fruit tree breeding. We first introduce the molecular marker systems and whole-genome sequence data that are available for fruit tree breeding. Next we introduce the statistical methods for biparental linkage and quantitative trait locus (QTL) mapping as well as GWAS and GS. We then review QTL mapping, GWAS, and GS studies conducted on fruit trees. We also review novel technologies for rapid generation advancement. Finally, we note the future prospects of genomics-assisted fruit tree breeding and problems that need to be overcome in the breeding.
Recent advancements in genomic analysis technologies allow cost-effective, high-throughput, and high-density genotyping of genome-wide DNA polymorphisms (Davey et al. 2011). These technological advancements have opened up new avenues to promote the efficiency of plant breeding (Chia and Ware 2011, He et al. 2014, Varshney et al. 2014). Genome-wide association studies (GWAS) enable the detection and identification of quantitative trait loci (QTLs) and genes controlling phenotypic variations in a collection of cultivars and germplasm accessions (Brachi et al. 2011, Hamblin et al. 2011). Genomic selection (GS; Meuwissen et al. 2001) enables the selection of superior genotypes based on genomic estimated breeding values (GEBV) derived from the information of genome-wide DNA polymorphisms (Heffner et al. 2009, Jannink et al. 2010, Lorenz et al. 2011). These novel genomics-based approaches evolved from traditional biparental QTL mapping and marker-assisted selection (MAS), but they have a much broader range of application and greater potential than the traditional methods. The advancements in genomic analysis technologies are leading to the active use of genomics-based approaches for plant breeding and genetics research.
Genomics-based approaches can be especially useful in fruit tree breeding (Chia and Ware 2011, Meneses and Orellana 2013, Myles 2013, van Nocker and Gardiner 2014). Conventional breeding of fruit trees has been hampered by their long generation time, large size, extended juvenile phase for seedlings, and a marketable product that cannot be assessed until a seedling is physiologically mature (Luby and Shaw 2001, Rikkerink et al. 2007). Numerous biotic and abiotic factors that affect both the quality and quantity of fruits during the pre- and post-harvest periods also complicate genetic improvement (Rikkerink et al. 2007). In fruit tree species, genomics-based approaches have great potential to break through these barriers. For example, GWAS enables researchers to estimate the positions and effects of QTLs/genes using existing cultivars/lines without preparing a segregating population (Khan and Korban 2012). When phenotypic data are already available for the existing cultivars/lines, the positions and effects of QTLs/genes can be estimated without performing field experiments. Markers that have significant association with target traits can then be used in MAS programs. GS increases the selection accuracy and/or genetic gain per unit time spent in the genetic improvement of fruit quality and yield traits (Kumar et al. 2012b). Selection during the juvenile phase speeds the selection process and filters out a substantial proportion of progeny that will proceed to evaluation in field trials, which accelerates fruit tree breeding via shortened breeding cycles and increased selection intensity (Luby and Shaw 2001, Ru et al. 2015). Thus, GWAS and GS are promising methods for promoting the efficiency of fruit tree breeding.
In this article, we describe the potential of genomics-assisted breeding using novel genomic technologies in fruit tree species. Biparental QTL mapping and MAS of major QTLs will not completely give way to GWAS and GS, and it is important to use the right methods in the right places. We first introduce molecular marker systems and whole-genome sequence data that are available for fruit tree breeding. Next we introduce the statistical methods for biparental linkage and QTL mapping, GWAS, and GS that underpin genomics-assisted breeding, as well as computer programs available to implement the statistical methods. We then review QTL mapping, GWAS research, and GS studies conducted on fruit trees and some other species. Because trees have a long generation time, MAS and GS alone cannot accelerate the breeding of fruit trees. Therefore, we review novel technologies for rapid generation advancement that can be used to accelerate fruit tree breeding. Finally, we note the future prospects of genomics-assisted fruit tree breeding and problems that need to be overcome.
Molecular markers are indispensable for genomics-assisted breeding, and various marker systems have been developed. The availability of fruit tree genome sequences enables researchers to develop genome-wide markers for high-throughput genotyping and to construct high-density genetic maps (Myles 2013). It also improves our understanding of the molecular mechanisms underlying fruit traits and provides important clues concerning the evolution of their complex genomes.
Simple sequence repeat (SSR) markers are designed through both transcript sequences and genomic sequences by using bioinformatics programs; SSR markers are useful tools for gene/QTL mapping, MAS, and diversity analysis (Segura et al. 2008). SSR markers have several advantages as genetic markers: they are codominant, multi-allelic, reproducible, and often amplifiable across species (Miah et al. 2013). Because of this cross-species transferability, SSR markers have been used for comparative mapping and genome synteny analysis in Rosaceae species (e.g., Dirlewanger et al. 2004, Fan et al. 2013, Gasic et al. 2009).
Single nucleotide polymorphism (SNP) markers are also being developed using expressed sequence tags and the genome sequence. These markers are cost-effective in terms of cost per marker and allow for higher throughput screening and higher density mapping compared to SSR markers. SNP arrays such as the Illumina Infinium II system have been developed for fruit tree species (Chagné et al. 2012, Myles et al. 2010). To date, the apple SNP array has been used to create linkage maps (Antanaviciute et al. 2012) and to assist GS (Kumar et al. 2012b) and GWAS (Kumar et al. 2013).
Next-generation sequencing (NGS) technologies have been used for genotyping SNPs via genotyping by sequencing (GBS) methods (Elshire et al. 2011) and restriction site–associated DNA sequencing (Baird et al. 2008). The NGS-based genotyping methods enable the simultaneous detection of thousands of SNPs throughout the genome in mapping populations. An advantage of NGS-based genotyping over SNP arrays is that there is no need for SNP discovery and array design. For newly targeted species or populations, SNP array development is very time-consuming and costly. Because NGS-based genotyping methods can be used without a reference genome (Catchen et al. 2011), they are useful for a range of organisms, including minor fruit trees. Another advantage of the NGS-based genotyping methods is that they are less affected by the ascertainment bias that may influence GWAS and GS (Albrechtsen et al. 2010, Heslot et al. 2013). Although NGS-based genotyping generally produces a relatively large number of SNPs, most SNPs have numerous missing data across samples (Gardner et al. 2014). One solution for this problem is to use an imputation method to fill in the missing genotypes, such as the TASSEL-GBS pipeline (Glaubitz et al. 2014) and Beagle (Browning and Browning 2007).
Fruit tree genome databases that house genomic, genetic, and breeding resources provide an effective platform for breeding programs. It is important for fruit researchers to understand how these data can be used to solve problems in fruit production. Genetic map, marker, and QTL data facilitate the development of markers for genomics-assisted breeding and the discovery of genes underlying important agricultural traits.
Sequencing projects for fruit genomes have been reported at Phytozome (www.phytozome.net), which provides well-controlled microsynteny and gene family evolution data. Tree fruit Genome Database Resources (www.tfgdr.org) is a collection of Rosaceae, Citrus, and Vaccinium bioinformatics resources and software tools (Wegrzyn et al. 2012). The Genome Database for Rosaceae (www.rosaceae.org) is a central repository of genetics and breeding data and analysis tools for Rosaceae. It has a publicly available Breeders’ Toolbox that is integrated with existing genomic and genetic data (Jung et al. 2014). The Citrus Genome Database (www.citrusgenomedb.org) is a community database providing access to citrus genomics, genetics, and breeding research (Jung et al. 2008). A genome database for Vaccinium (www.vaccinium.org) is being developed to house and integrate genomic, genetic, and breeding data for blueberry, cranberry, and other Vaccinium species. The Plant Genome Duplication Database (chibba.agtec.uga.edu/duplication/) is a Web service providing intra-genome or cross-genome syntenic relationships (Lee et al. 2013), which helps to identify conserved genes between species. The development of genetic and genomic resources will facilitate genomics research and genomics-assisted breeding applications.
The objective of linkage analysis is to estimate the recombination frequency between markers or between markers and loci affecting a trait; the analysis includes construction of a linkage map and mapping of genes that determine simply inherited traits, using multi-generation families. In fruit trees, a full-sib family consisting of two parental cultivars (hereafter T1 and T2) and their F1 progeny is commonly used for linkage analysis (e.g., Cristofani et al. 1999, Doucleff et al. 2004, Fernández-Fernández et al. 2012, Hemmat et al. 1994, Yamamoto et al. 2002) because the inclusion of more than two generations is hindered by the long generation time.
In a two-generation family, recombination events occur during gametogenesis in each of the parents and the recombination frequency is estimated based on gametes transmitted from the parents to F1 individuals. Consequently, linkage analysis is conducted separately for T1 and T2, regarding the F1 progeny as a hypothetical half-sib progeny produced from each T1 and T2 and another parent as an anonymous mating partner. Linkage analysis in a half-sib design is carried out in a similar way as for a backcross population derived from crossing two inbred lines, referred to as a BC population. For a half-sib family consisting of T1 and the F1 progeny, regarding T2 as an anonymous mating partner of T1, heterozygous markers or loci in T1 can be used for estimating recombination frequencies. In applying the back-cross analysis to this half-sib family, T1 and T2 are treated as the F1 individual and recurrent parent in the backcross design, respectively.
It should be noted, however, that in a half-sib family it is necessary to distinguish the recombinant and non-recombinant types of gametes transmitted from T1 to F1 progeny with respect to two linked markers because the linkage phase in T1 is usually unknown for the two markers. Consider two linked markers, A and B, and assume that the genotype of T1 is A1A2B1B2, where A1 (B1) and A2 (B2) are two different alleles of A (B). Taking the linkage phase into consideration, there are two possible diplotypes for A1A2B1B2, A1B1/A2B2 and A1B2/A2B1, where a diplotype is a representation of a genotype including the information of linkage phase. If an F1 individual receives a gamete with haplotype A1B1 from T1, this gamete is classified as non-recombinant or recombinant according to the two possible diplotypes of T1, A1B1/A2B2 and A1B2/A2B1, respectively. The diplotype of T1 is inferred based on the number of gametes with haplotypes A1B1 or A2B2, n1, and that of gametes with A1B2 or A2B1, n2, transmitted from T1 to F1 individuals. For example, the diplotype of T1 is inferred as A1B1/A2B2 when n1 > n2 and as A1B2/A2B1 otherwise. Given the diplotype of T1 for linked markers, linkage analysis is performed based on the genotypes of the F1 individuals in the same way as in a backcross design.
This strategy of linkage analysis for a half-sib family of one parent and the F1 progeny included in a full-sib family is called the pseudo-testcross strategy (Grattapaglia and Sederoff 1994). For linkage analysis of the gametes from T2, the same process is applied. A two-way pseudo-testcross entails conducting pseudo-testcross linkage analyses for each of two parents in turn (Grattapaglia and Sederoff 1994).
Backcross analysis is implemented in most software for linkage analysis, such as MAPMAKER (Lander et al. 1987), AntMap (Iwata and Ninomiya 2006), CARTHAGENE (De Givry et al. 2005), and JoinMap (van Ooijen 2006). Therefore, construction of a linkage map based on the pseudo-testcross design is allowed with these software packages for a full-sib family. In linkage analysis with a backcross design, the genotype of a marker for an individual in the BC population is recorded as “A” or “H” depending on which one of the two parental inbred lines is the origin of the allele transmitted by F1 to the BC individual. In a pseudo-testcross design, T1 is treated as the F1 of the backcross design, but the parental origins of the two alleles possessed by T1 are usually unknown. Therefore, the following procedure is used for pseudo-testcross analysis with the backcross option in these software packages.
For each marker, the genotypic data of F1 progeny are recorded in two different ways. Assuming that the genotype of T1 at a marker is A1A2, for example, the genotype of an F1 individual is recorded for the marker as “A” if it receives the A1 allele from T1 and as “H” otherwise, considering the genotype A1A2 as A1/A2, where the left side of the slash is the maternal allele and the right side is the paternal allele. Subsequently, for the same marker, the genotype of an F1 individual is once again recorded, but this time exchanging “A” and “H” in the original record of the genotype by regarding A1A2 as A2/A1. Two genotype datasets for a marker thus obtained are treated as those of two different markers, such that the number of markers treated is duplicated. When identifying linkage groups for these duplicated markers, a set of linkage groups consisting of pairs of equivalent linkage groups is obtained. Within a linkage group, the records of genotypic data of markers suitably reflect the diplotype of T1. One linkage group is selected from the pair of equivalent linkage groups to construct the map of markers for the linkage group. By replicating this process for each pair of linkage groups in turn, the linkage maps for all linkage groups are produced.
In applying the two-way pseudo-testcross strategy to a full-sib family, we obtain two separate linkage maps for female and male parents. By using multi-allelic markers such as SSR markers with more than two alleles segregating in F1 individuals, markers heterozygous in both parents can be mapped in both linkage maps, which can be used for aligning female and male linkage maps (Maliepaard et al. 1998). By using the software JoinMap (van Ooijen 2006), two parental linkage maps can be integrated.
Many fruit tree traits of economic importance, such as those related to fruit quality and productivity, are quantitative traits, and loci affecting such traits are called QTLs. A statistical method that combines linkage analysis with a statistical model of phenotypic values of a trait, referred to as QTL analysis, is applied for mapping QTLs and estimating their effects. As in linkage analysis, a full-sib family consisting of two parents, T1 and T2, and their F1 progeny, is commonly used as the analyzed population in QTL analysis of fruit trees (e.g., Ban et al. 2014, Kunihisa et al. 2014, Siviero et al. 2006, Weber et al. 2003, Yamamoto et al. 2014). Because the procedure of QTL analysis is partly based on that of linkage analysis, the pseudo-testcross strategy can also be used for the analysis of QTLs in a full-sib family (Grattapaglia et al. 1995), with QTL analysis conducted separately for T1 and T2 and with the F1 progeny regarded as a half-sib progeny obtained from each T1 and T2. When pseudo-testcross analysis is applied to both parents consecutively, this process of QTL analysis is referred to as a two-way pseudo-testcross (Grattapaglia et al. 1995), as in linkage analysis. In pseudo-testcross QTL analysis, heterozygous QTLs in the parent are targets of QTL detection and, using the linkage map of the parent, the analysis is conducted as in the backcross design of QTL analysis in inbred species, except that the haplotypes for marker genotypes of the parent must be inferred. Because interval mapping (Lander and Botstein 1989) is commonly applied to QTL analysis of a BC population in inbred species, we explain the application of this method for pseudo-testcross design.
In interval mapping, any position on a linkage map is tested for the presence of a QTL affecting the trait. A putative QTL, Q, is located at a tested position between two markers, A and B, on a linkage map and the effect of Q is evaluated. Consider a pseudo-testcross strategy of QTL analysis for T1 in a full-sib progeny derived from crossing T1 and T2. We assume that the QTL genotype of T1 at Q is Q1Q2, with Q1 and Q2 being the different alleles of Q. In F1 progeny including n individuals, the phenotypic value of a trait for the ith F1 individual, yi, is expressed as a linear model:
where μ is the intercept of the model, ui is a covariate indicating the allele of Q transmitted to the ith F1 individual from T1 (taking values 1 and 0 for alleles Q1 and Q2, respectively), a is the difference in allelic effects between Q1 and Q2, and ei is a residual. The allele the F1 individual receives from T1 (i.e., Q1 or Q2) is inferred from the alleles of the flanking markers A and B transmitted from T1 to the F1 individual, where it is assumed that the haplotype of T1 with respect to A and B is known. The model parameters, including a and μ, are estimated such that the parameter values fit well the data of phenotypic values of n F1 individuals under the model. The presence of a QTL at a position of interest is judged by comparing the goodness of model fit between two hypotheses, H1: a ≠ 0 and H0: a = 0, representing the presence and absence of a QTL, respectively (Lander and Botstein 1989). Test statistics such as an LOD score are used for testing H0 versus H1. The effects of QTLs on regions other than a tested position can be included in model (1) to enhance the power of QTL detection and the precision of estimation of QTL position and effect (Jansen 1993, Zeng 1994).
The interval mapping method is implemented in computer programs widely used for QTL analysis, including Mapmaker/QTL (Lincoln and Lander 1990), QTL Cartographer (Wang et al. 2010), R/qtl (Broman et al. 2003), and MapQTL (van Ooijen 2009). Pseudo-testcross analysis of QTLs is performed with the backcross option in these software packages using the data of phenotypes and marker genotypes of the F1 progenies recorded following the haplotype of one of the parents in a full-sib family.
Based on an integrated linkage map constructed from two parental maps, a full-sib analysis of QTLs can be performed to investigate the presence of QTLs heterozygous in one or both parents. Full-sib QTL analysis with interval mapping is implemented in MapQTL (van Ooijen 2009). Assuming that the genotypes of T1 and T2 at a putative QTL, Q, located at a tested position are Q1Q2 and Q3Q4, respectively, the model of interval mapping for full-sib analysis is:
where μ, ui, a, and ei are the same terms as in the pseudo-testcross model (1), vi is a covariate indicating which of the alleles at Q (Q3 or Q4) is transmitted to the ith F1 individual from T2, and b is the difference in the allelic effects between Q3 and Q4. The full-sib analysis with model (2) requires marker haplotypes to be known for T1 and T2. The procedure of inferring marker haplotypes can be incorporated in the QTL analysis (Hayashi and Awata 2004, Knott et al. 1996).
In breeding of fruit trees such as apple, multiple full-sib families obtained from crosses among existing cultivars are established as a breeding population. The information of QTLs obtained from QTL analysis conducted individually for each of the multiple full-sib families can be integrated to increase the reliability of the estimated QTL positions and effects. Such an integrated QTL analysis was performed for fruit quality traits in apple by Costa (2015), where four full-sib families were individually analyzed with model (2), and then QTLs detected in the individual analyses were projected on a consensus map of the four families to elucidate reliable genomic regions of the QTLs. This integrated analysis using the QTL information for multiple families is called MetaQTL analysis (Costa 2015, Veyrieras et al. 2007).
The parental cultivars used for crosses to establish multiple full-sib progenies may be genetically related due to their common ancestral cultivars. By adding such ancestral cultivars, multiple full-sib families are treated as a large complex pedigree. The analysis of such a large pedigree allows the QTL positions and effects to be estimated more precisely.
Bink et al. (2014) performed QTL analysis on a pedigree including 27 full-sib families derived from crosses among 33 parental cultivars, containing a total of 1300 individuals, and all of the ancestral cultivars of the parental cultivars. The ancestral cultivars for which both parents were unknown were regarded as founders in this large pedigree. The authors assumed a biallelic QTL with alleles Q and q and applied a multiple-QTL model, which is expressed for the phenotype of the jth individual of the ith full-sib family, yij, as:
where μ is the intercept of the model, N is the number of QTLs included in the model, uijl is the covariate indicating the genotype of the lth QTL for the individual, al is the effect of the lth QTL, and eij is a residual. Bayesian estimation was applied to estimate the model parameters, where the prior probabilities for the QTL number N, the QTL positions and effects, and the QTL genotypes of the founders were taken into consideration. Given the QTL genotypes of the founders, possible allele transmissions from the founders to the parental cultivars through all of the ancestors were simulated. The posterior distributions of the model parameters including N, QTL positions, and QTL effects as well as the QTL genotypes of the founders were obtained through Markov chain Monte Carlo iterations (Bink et al. 2014).
Association analysis is a method for finding an association between markers and loci affecting a trait by assessing the correlation between the genotypes of markers and phenotypes in an analyzed population. By using a large number of markers covering an entire genome, GWAS allows the confirming such an association for most genome segments. In GWAS, genome segments of analyzed individuals are discriminated based on the allelic states of markers located on the segments, whereas in QTL analysis those are discriminated based on their parental origins in the analyzed family. In GWAS, therefore, various populations, including a collection of individuals sampled from wild populations, germplasm, and breeding cultivars, are analyzed in addition to the multi-generation families derived from crossing parental cultivars, as used in QTL analysis.
In both GWAS and QTL analyses, the successful detection of genome regions associated with phenotypes relies on the availability of markers linked to the loci affecting a trait. The linkage between markers and the loci in the genomes of parents is weakened by recombination events occurring as generations advance, but the decay of the linkage is subtle in QTL analysis due to the limited number of recombination events in a few generations. The existence of an association between markers and the loci in an analyzed population caused by linkage or some other factors, such as selection, over the history of the population is called linkage disequilibrium (LD). The length of a genome segment that can be covered by a marker depends on the degree of LD in the genome region in the population. In populations with a higher degree of LD, such as multi-generation families from crossing parental cultivars, the number of markers required for covering the entire genome is small, whereas numerous markers are required for covering the genome in populations with a lower degree of LD, such as a collection of distantly related individuals including germplasm collection. Mapping resolution is also influenced by the degree of LD, with finer resolution as the degree of LD decreases.
Recently, the information on several thousands to tens of thousands of SNPs distributed across a whole genome at high density has become available for fruit trees. Therefore, GWAS may be an effective method for mapping loci affecting a trait in a collection of cultivars without the need to establish experimentally crossed populations, which is hindered in most fruit trees due to the long generation time, thus replacing family-based QTL analysis (Khan and Korban 2012). Although GWAS is applied to a wide range of populations, factors such as population stratification and cryptic relatedness among individuals may cause an increased rate of false-positives, meaning spurious genotype–phenotype associations (Kumar et al. 2013). Thus, correction terms for population structure and kinship relationships must be included in statistical models of GWAS to control the false-positive rate.
A mixed linear model is a statistical model, which takes population structure and kinship relationship into consideration, and is suitable for GWAS (Yu et al. 2005). In this model, each marker is tested in turn for association with a trait, while incorporating population structure and kinship relationship as fixed and random effects, respectively. Consider a population consisting of n individuals with records of phenotypes and genotypes of p markers covering the entire genome, and denote the phenotypic value of the ith individual as yi (i = 1, 2, …, n). The following model is used to test the lth marker in a mixed linear model:
where μ is the intercept of the model, zil is the covariate indicating the genotype of the ith individual at the lth marker (e.g., taking values 0, 1, and 2 for three possible genotypes when the marker is biallelic), al is the effect of the lth marker, xij is the covariate relating cj (i.e., the effect of the jth non-genetic factor) to yi, gi is the genotypic value of the ith individual contributed by polygenic effect not captured by a tested marker, and ei is a residual. In estimating the model parameters, al and cj are treated as fixed effects and gi is treated as a random effect. The influence of the population structure is included in the model as non-genetic effects cj for some j in the model.
The random polygenic effects of n individuals, collectively written as g = (g1, g2, …, gn)’, with the prime symbol representing transpose of a vector, are assumed to follow a multivariate normal distribution with a mean vector 0 and a variance-covariance matrix Aσg2, where A is a kinship matrix with the (j,k)th elements indicating the genetic relationship of a pair of the jth and kth individuals. Kinship relationships are considered in the construction of A using pedigree information or marker genotypes. Assuming that g is estimated for all markers, we can write g = Za, where a is a vector of the effects of all markers and Z is a design matrix relating a to g, with the (i,m)th element zim being the covariate indicating the genotype of the ith individual at the mth marker. Therefore, Z is regarded as the information of marker genotypes. Assuming that a is a random effect following a multivariate normal distribution with mean 0 and variance-covariance matrix Iσa2, with I being an identity matrix, A is expressed as A = ZZ′σa2/σg2, indicating how marker genotypes are used for constructing A (Goddard 2009). Because g is the polygenic effect not captured by the lth marker in model (4), zil and al might be excluded from Z and a. When the number of markers, p, is very large, as is the case for GWAS, the inclusion of al in g provides almost the same result as GWAS with al excluded from g.
Using the vector and matrix forms, model (4) is rewritten as:
where b is a vector of fixed effects containing the intercept (μ) and non-genetic effects (cj), zl = (z1l, z2l,…, znl)’ is a vector of the covariates indicating the genotypes of the lth marker for n individuals, g is a vector of the random polygenic effects of n individuals as described above, and e = (e1, e2, …, en)’ is a vector of residuals. Significance testing for the association of each tested marker with the phenotype is based on the estimates of al, âl, where a statistical test of hypotheses H0: al = 0 versus H1: al ≠ 0 is conducted. In the usual procedure of GWAS with a mixed linear model, the P value of the estimate âl under H0 is calculated and −log10(P value) is plotted against each marker position to detect markers significantly associated with the phenotype; this plot is called a Manhattan plot. When H0 is rejected, the association of a tested marker and phenotype is regarded as significant and a marker affecting a trait is detected. Estimation of parameters in mixed linear model (5) can be conducted with several software packages, including the rrBLUP package in R program (Endelman 2011) and TASSEL (Bradbury et al. 2007).
By substituting g with Za and assuming that zlal is included in g in model (5), we obtain the modified model:
All markers are simultaneously fitted using model (6) for GWAS. The number of markers, p, is often much greater than that of analyzed individuals, n, when high-density SNP markers are used. For simultaneously estimating the effects of all markers, a, under this situation of p > n, model (5) is handled as a mixed linear model by treating a as random effects, where the method of best linear unbiased prediction (BLUP) is adopted for estimating a, or managed by applying Bayesian estimation assigning a prior distribution for a.
Iwata et al. (2013a) applied Bayesian estimation for GWAS in a population of Japanese pears using genome-wide SSR markers based on model (5) incorporating a variable selection procedure. To date, many types of Bayesian methods have been proposed (e.g., Bayes A, Bayes B, Bayes Cπ, Bayes Dπ, Baysian LASSO), each depending on the prior distributions assumed for a and other parameters and estimation procedures with or without variable selection. Several methods of Bayesian estimation for model (6) can be performed with the BGLR package in the R program (Pérez and de los Campos 2014).
GS is a method of individual selection based on prediction of the genotypic value of a target trait based on the genotypes of genome-wide markers (Meuwissen et al. 2001). From the point of view of artificial selection, the genotypic value is called the breeding value. Breeding values for individuals with marker genotypes in GS are predicted using the same statistical models as in GWAS, where genotypes and phenotypes of a collection of individuals, referred to as the training population, are used for the estimation of model parameters. Breeding values are predicted for selection candidates based only on their genotypes with the prediction model thus constructed.
Using the same notations as in GWAS, the objective of GS is to predict g for selection candidates based on the information of marker genotypes Z. The predicted value of g, denoted as ĝ, is referred to as the genomic estimated breeding value (GEBV). As we denote g = Za with effects of all markers available (a), GEBV is obtained as ĝ = Zâ, with â being the estimate of a calculated from model (5) using the data of marker genotypes and phenotypes of the training population. Based on the model obtained by omitting the term zlal from model (5), GEBV ĝ is directly calculated using a BLUP method without estimating a, where kinship matrix A is calculated with the marker genotypes Z. The method of BLUP based on a kinship matrix obtained by marker genotypes is referred to as genomic BLUP (GBLUP; van Raden 2008). Bayesian methods are also applied to the estimation of breeding value g, where some prior distributions are assigned to a.
Kumar et al. (2012b) applied both BLUP and Bayesian LASSO to predict GEBV of fruit quality traits in apple. Instead of using statistical linear models such as model (6), some machine-learning algorithms for the discrimination of objects based on a large number of features, including support vector machine and random forest, also have been applied to predict GEBV in GS (Jannink et al. 2010).
Most of agronomic and horticulturally important traits, such as fruit quality, are quantitative and controlled by multiple, sometimes numerous, genes or QTLs. Many genes and QTLs for disease and pest resistance have been reported, including those for scab resistance in apple (Bus et al. 2010) and pear (Pierantoni et al. 2007, Terakami et al. 2006), plum pox virus resistance in apricot (Soriano et al. 2008), brown rot resistance in peach (Pacheco et al. 2014), and downy and powdery mildew resistance in grapevine (Riaz et al. 2011, van Heerden et al. 2014). Various QTLs controlling fruit quality traits (e.g., harvest time, fruit skin color, fruit weight, and sugar content) have also been identified in apple (Kenis et al. 2008, Kunihisa et al. 2014), pear (Yamamoto et al. 2014, Zhang et al. 2013), peach (Eduardo et al. 2011, Martínez-García et al. 2013), and grapevine (Correa et al. 2014).
QTL mapping is commonly performed using a single full-sib family, thus highlighting the instability of QTLs among different genetic backgrounds (Kenis et al. 2008). To overcome this limitation of QTL mapping, MetaQTL analysis is used; it is a novel tool that allows the results of multiple independent QTL mapping experiments to be integrated (Veyrieras et al. 2007). MetaQTL analyses have been performed for plum pox virus resistance in apricot (Marandel et al. 2009) and fruit quality traits in apple (Costa 2015).
Compared to QTL mapping, GWAS is more suitable for QTL detection in fruit trees because it does not require biparental populations. Generating segregating populations derived from biparental crosses of fruit trees is difficult and costly due to their long juvenile periods (Khan and Korban 2012, Rikkerink et al. 2007). There are few studies on genetic determination of quantitative traits in fruit trees using GWAS (Table 1). In apple, Kumar et al. (2013) conducted GWAS using 1200 seedlings of seven full-sib families to reveal significant associations of 2500 SNPs with six fruit traits. Significant associations were found in all six traits in the genomic regions, some of which were coincident to known candidate genes. Iwata et al. (2013a) conducted GWAS using 76 cultivars of Japanese pear to found significant associations of 162 markers with nine agronomic traits. Significant associations were found for harvest time, black spot resistance, and spur number. In peach, Cao et al. (2012) conducted GWAS using 104 landrace accessions genotyped with 53 genome-wide SSR markers, and they found associated markers for 10 traits related to fruit and phenological period. Association studies with candidate genes have been conducted in apple, and contributed to finding a gene controlling fruit flesh firmness (Cevik et al. 2010). Significant overlaps between results of QTL mapping and association studies for disease resistance and fruit quality traits have been reported in apple (Cevik et al. 2010, Kumar et al. 2012b, 2013), pear (Iwata et al. 2013a, Yamamoto et al. 2014), and peach (Cao et al. 2012, Picañol et al. 2013).
SNP, single nucleotide polymorphism; SSR, simple sequence repeat; RAPD-STS, random amplified polymorphic DNA-sequence tagged sites; ACC, 1- aminocyclopropane-1-carboxylate.
Although recent advances in genomics research have provided abundant genomics resources (e.g., genetic and physical maps, QTLs, and numerous molecular markers for many traits), there are few reports of MAS application (Folta and Gardiner 2009, Ru et al. 2015). Issues including technical and economic barriers have prevented the integration of MAS technology into conventional breeding. Choosing reliable markers from among robust markers is difficult for breeders. The LD between a marker and trait locus may differ in every population (Breseghello and Sorrells 2006), and the effect of QTLs may differ among genetic backgrounds and environmental conditions (Li et al. 2003, Liao et al. 2001). In addition, MAS will not always have greater cost-effectiveness than conventional breeding, because the cost of MAS depends on the target traits (e.g., monogenic or polygenic) and the ease of measuring them (Morris et al. 2003). To solve these issues, the RosBREED project funded by the USDA Specialty Crop Research Initiative was established in 2009 (Iezzoni et al. 2010). An eight-stage marker-assisted breeding pipeline was implemented to enable the use of marker-assisted breeding in rosaceous tree fruits. Consequently, MAS has been successfully applied in apple (Edge-Garza et al. 2010, Kellerhals et al. 2011, Peace 2013, Sebolt 2013) and sweet cherry (Haldar et al. 2010).
The application of MAS is increasing, but its technical limitations also have been revealed. MAS is effective for improving traits controlled by a small number of major genes and/or large-effect QTLs (e.g., pest and disease resistance), whereas it is difficult to use for traits controlled by a large number of minor genes, as is the case with many fruit quality traits (Jannink et al. 2010, Kumar et al. 2012a). The new technology of GS was proposed to overcome the limitations using whole-genome prediction models based on the genome-wide markers (Meuwissen et al. 2001).
Compared to MAS, GS is more suitable for selecting traits controlled by many minor genes, and there is no need to generate crossing populations to develop markers. In GS, selection decisions are based on the predicted GEBV of selection candidates. In recent years, new sequencing technologies have decreased the costs of SNP genotyping and resulted in greater availability of numerous markers, which will increase prediction accuracy. GS was first described in dairy cattle breeding programs (Calus 2010, Hayes et al. 2009) and has been used subsequently in crops (Heffner et al. 2009, Lin et al. 2014).
Because GS has great potential for streamlining plant breeding, it is being eagerly studied for its application to various plant species. The use of GS in plant breeding has been evaluated mainly from two perspectives: (1) deterministic or stochastic simulations; and (2) empirical data analysis. The GS studies of fruit trees have thus far included apple (Kumar et al. 2012a, 2012b), Japanese pear (Iwata et al. 2013a), and grapevine (Fodor et al. 2014); here, we also describe studies in other tree crops including forest trees and oil palm.
Grattapaglia and Resende (2011) conducted deterministic simulations and assessed the impact of the degree of LD, size of the training set, trait heritability, and number of QTLs on the accuracy of GS in forest tree breeding. The degree of LD, which is modeled by effective population size and marker density, had the largest impact on the accuracy. The accuracy of GS was comparable to that of traditional selection based on pedigree-based BLUP (PBLUP) even at a moderate marker density (2 markers/cM) when the effective population size was small (≤30); shortening the breeding cycle by 50% with GS provided an increase of at least 100% in selection efficiency, suggesting a promising effect of GS in tree breeding.
Iwata et al. (2011) performed stochastic simulations to evaluate the efficiency of GS in forest tree breeding, and their conclusion was consistent with that of Grattapaglia and Resende (2011). That is, by using a base population derived from a limited number (=25) of elite trees (i.e., small effective population size), GS was advantageous over phenotypic selection even for a low-heritability polygenic trait at a moderate marker density (1 marker/cM). The simulations also suggest that updating the prediction model is indispensable for attaining large genetic gain from GS breeding, because the pattern of LD changes with increasing selection cycles.
Denis and Bouvet (2013) performed stochastic simulations to assess the efficiency of GS in eucalyptus breeding, with findings similar to those of Grattapaglia and Resende (2011) and Iwata et al. (2011). That is, GS attained two or three times greater genetic gain per unit time than that of phenotypic selection, although the gain per cycle declined in later breeding cycles. The authors also compared the performance of GS models with and without dominance effects. The model with dominance effects performed better for clone selection when heritability was high and dominance effects were preponderant, but no improvement was detected for parent selection.
Fodor et al. (2014) performed stochastic simulations to assess the efficiency of GS for grapevine breeding. The authors simulated the domestication histories of three grapevine diversity groups and evaluated the accuracy of GS in simulated breeding populations. High accuracy levels were obtained using a core collection covering three diversity groups as a training population, and the highest prediction accuracy was attained with the combination of GWAS and GS.
Wong and Bernardo (2008) simulated GS breeding in oil palm and demonstrated the superiority of GS over marker-assisted recurrent selection and phenotypic selection in terms of genetic gain per unit cost and time. Although their simulation results are suggestive for tree breeding, the results might not be easily generalized because they simulated breeding populations derived from selfing a hybrid between two inbred lines, whereas actual oil palm breeding populations are more complex (Cros et al. 2015).
In fruit trees, empirical data analysis has been conducted in apple (Kumar et al. 2012b) and Japanese pear (Iwata et al. 2013a), as summarized in Table 2. Kumar et al. (2012b) assessed the accuracy of GS by analyzing seven full-sib families with 1120 individuals that were genotyped for 2500 SNPs on the International RosBREED SNP Consortium apple Infinium array v1. The accuracy of GS ranged from 0.68 to 0.89 and was higher than the accuracy of PBLUP for all traits. GS allows the modeling of the variation caused by random sampling of two possible alleles from each parent at each locus during meiosis (called Mendelian sampling), whereas PBLUP cannot take such variation into account. The higher accuracy of GS suggests that it could account for both family effects and Mendelian sampling. Two modeling methods, GBLUP and Bayesian LASSO, were compared, but the difference in the accuracy between GBLUP and Bayesian LASSO was small.
SNP, single nucleotide polymorphism; RR, ridge regression; rrBLUP, random regression best linear unbiased prediction; LASSO, least absolute shrinkage and selection operator; SSR, simple sequence repeat; RAPD-STS, random amplified polymorphic DNA-sequence tagged sites; ACC, 1- aminocyclopropane-1-carboxylate; GBLUP, genomic best linear unbiased prediction.
Iwata et al. (2013a) conducted an empirical data analysis of GS in Japanese pear. The authors used 76 cultivars genotyped for 162 markers including 155 SSRs and phenotyped with nine agronomic traits. Because phenotypes were scored as ordinal categorical data, Bayesian methods for estimating latent continuous variation of phenotypes were employed. The level of accuracy of GS prediction was high (0.75) or medium (0.38–0.61) in seven of the nine traits. For fruit quality traits, no significant association was detected by using GWAS, but GS prediction showed medium levels of accuracy except for one trait. In traits for which a significant association was detected in GWAS, GS predictions were more accurate than those based on significant markers, suggesting the traits are determined by several minor and medium QTLs as well as major QTLs.
Cros et al. (2015) evaluated the prediction accuracy of GS in oil palm, using two parental populations involved in conventional reciprocal recurrent selection with 131 individuals each and genotyped for 265 SSRs (Table 2). The authors compared five GS modeling methods: GBLUP, Bayesian LASSO, Bayesian ridge regression, BayesCπ, and BayesDπ. The accuracy of GBLUP was significantly higher than that of PBLUP in three of eight traits in one parental population, but it was equal to that of PBLUP in the other parental population. These results suggest that GS could account for family effects and Mendelian sampling terms in the former population, but only family effects in the latter population. Fewer polymorphic markers and lower marker density may have impaired the advantage of GBLUP over PBLUP in the latter population. Differences in accuracy were small among the five modeling methods.
In forest tree species, empirical data analysis of GS has been performed for loblolly pine (M.F.R. Resende et al. 2012a, 2012b, Zapata-Valenzuela et al. 2012, 2013), eucalyptus (M.D. Resende et al. 2012), and white spruce (Beaulieu et al. 2014a, 2014b), as summarized in Table 2. In their pioneering study, M.F.R. Resende et al. (2012a) evaluated the accuracy of GS in loblolly pine, using 800 clonally replicated individuals grown at four sites and genotyped for 4825 SNPs. The accuracy of GS ranged from 0.64 to 0.74 and was comparable with or slightly less than that of selection based on PBLUP. The authors evaluated the accuracy of GS models across ages and environments. The model generated at early ages did not perform well in predicting phenotypes at later ages (6 years). The accuracy of models was highest at the sites where the models were generated but declined at different sites, suggesting that genotype × environment (GE) interactions greatly affect the transferability of models across sites.
M.D. Resende et al. (2012) used two breeding populations to evaluate the accuracy of GS in eucalyptus. The two populations contained 43 and 75 families (738 and 920 individuals, respectively) sampled for GS. The accuracy of GS prediction ranged from 0.74 to 0.88 in one population and from 0.55 to 0.73 in the other. The higher accuracy in the former population was expected because it had a smaller effective population size (11) than the latter one (51). The accuracy of GS was comparable to or greater than that of selection based on PBLUP. GS models showed poor predictability across populations, likely as a result of variable patterns of LD, inconsistent allelic effects, and GE interactions.
The prediction model of GS also can be used for predicting the segregation pattern of target traits in a progeny population. In cross breeding, it is important to select an optimal parental combination that has a high probability of generating offspring with desired characteristics. In fruit trees, the establishment of a segregating population and field evaluation of the population require much time and cost, therefore systematic planning for selecting an optimal combination becomes even more important. Iwata et al. (2013b) proposed a novel method for predicting the segregation of traits in a progeny population based on GS prediction models and applied the method to Japanese pear data in a proof-of-concept study. Empirical analysis using an actual breeding population and a simulation study based on real marker data suggested that the segregation of target traits can be predicted with reasonable accuracy, especially for a highly heritable trait. The proposed method can provide objective and quantitative criteria for choosing an optimal parental combination and sufficient breeding population size.
Most fruit tree crops have a long juvenile phase. For example, apple takes 5–12 years until the transition to the adult phase, whereas peach can start flowering within several years after germination (Hansche 1986, Visser 1964). To reduce the time and cost of fruit breeding, more effective methods for accelerating generation time are desired (van Nocker and Gardiner 2014). Recently, a fast-track breeding system, which is based on controlling the juvenile–adult phase transition by inducing a flowering gene or/and silencing a floral repressor combined with MAS, was developed to shorten the juvenile phase in fruit crops (Flachowsky et al. 2007, 2011, Wenzel et al. 2013).
In the model dicot Arabidopsis thaliana, many factors involved in floral induction have been identified over the last two decades (Amasino 2010). For fast-track breeding, the most promising gene may be FLOWERING LOCUS T (FT). AtFT encodes a mobile regulator of flowering under the long-day condition, which matches the characteristics of a florigen (Corbesier et al. 2007). The introduction of Citrus FT (CiFT) to trifoliate orange drastically shortened the juvenile phase and resulted in flowering as early as 12 weeks in a greenhouse, whereas the first flowering normally occurs after at least several years (Endo et al. 2005). Similarly, overexpression of MdFT1, an apple FT-like gene, induced precocious flowering in apple (Kotoda et al. 2010). FT-like genes from poplar under the control of the heat-inducible Gmhsp 17.5-E promoter from soybean were also used for floral induction in apple to avoid the disadvantages of constitutive expression (Wenzel et al. 2013).
Several other candidates for floral induction of fruit crops have been identified. FT activates various downstream genes, including a floral meristem identity gene, APETALA1 (AP1; Abe et al. 2005, Wigge et al. 2005). Transgenic citrus (Citrus sinensis × Poncirus trifoliata) with constitutive expression of AP1 produced fertile flowers and fruits in the first year (Peña et al. 2001). LEAFY (LFY) activates AP1 and is also involved in the specification of floral meristem identity (Wagner et al. 1999, Weigel et al. 1992). Similar to AP1 transformants, transgenic citrus expressing LFY also showed the early-flowering phenotype (Peña et al. 2001). TERMINAL FLOWER 1 (TFL1) encodes an FT-like protein, an antagonist to LFY and AP1; it represses the conversion of inflorescence meristems to floral meristems (Liljegren et al. 1999). In apple and pear, the silencing of TFL1-like genes reduced vegetative growth and resulted in early flowering within several years even during in vitro cultivation. (Flachowsky et al. 2012, Freiman et al. 2012). FRUITFULL (FUL), a gene closely related to AP1, plays redundant roles in floral meristem identity (Ferrándiz et al. 2000, Litt and Irish 2003). Flachowsky et al. (2007, 2011) used BpMADS4, a FUL-like gene from silver birch, for floral induction of apple and sped up the generation cycle.
In addition to transgenesis, a technique for floral induction of host crops using a plant virus vector has been recently developed. An Apple latent spherical virus (ALSV) vector containing AtFT induced flowering of 30% of apple seedlings 1.5–2.0 months after inoculation (Yamagishi et al. 2011). Virus-induced silencing of MdTFL1-1, an apple homolog of TFL1, caused early flowering in approximately 10% of inoculated apple seedlings (Sasaki et al. 2011). Furthermore, simultaneous induction of AtFT and silencing of MdTFL1 using an ALSV vector more stably induced flowering (90% of seedlings) and resulted in the completion of one life cycle within a year (Yamagishi et al. 2014). In early-flowering transgenic lines, transgenes can be eliminated by segregation before producing the final cultivars (van Nocker and Gardiner 2014). On the other hand, the ALSV vector does not influence the genotype and the floral character of the next generation, as ALSV is rarely detected in the successive progenies (Yamagishi et al. 2014). In the future, plant virus vector–induced transient induction may be applied to other fruit crops and become more promising for accelerating generation times.
GWAS and GS will become increasingly important methods in future fruit tree breeding and genetics, in combination with the increased throughput and decreased cost of genome-wide SNP genotyping and the improved accuracy and power of statistical methods. GWAS can estimate the location and effects of QTLs/genes related to target traits, and the estimates can be further used in MAS, GS (Foder et al. 2014), and their combinations (van Nocker and Gardiner 2014). GS is also useful for traits controlled by numerous small-effect loci. Even in traits for which no significant association is detected, GS predictions can attain some level of accuracy (Iwata et al. 2013a). In addition to GWAS and GS, generation advancement technologies that promote rapid-cycle breeding with short generation time and small plant size will help to streamline fruit tree breeding.
GWAS and GS require phenotypic data and marker genotypic data for analysis and modeling. Collecting phenotypic data for a large number of cultivars/lines, however, is costly because of the problems that have hampered fruit tree breeding, namely the long juvenile phase and large plant size. As suggested by Myles (2013), researchers need to think about how to increase the throughput of their phenotyping strategies and how to establish populations suitable for genetic mapping, MAS, and GS. One way to collect phenotypic data for numerous cultivars/lines is to use breeding populations, which are routinely developed and evaluated in breeding programs. If phenotypic and marker genotypic data can be routinely gathered in breeding programs, the resultant collection will boost the detection power of GWAS and the accuracy of GS. The data collected from breeding populations also will be useful for elucidating functional genomics in plants (Poland 2015). Another possible way is to use marker genotypic data for identifying the optimal subset of cultivars/lines to phenotype, because marker genotyping is less costly than phenotyping. Careful selection of the subset based on an optimization algorithm will increase the accuracy of GS (Cros et al. 2015). When there is a subpopulation structure in breeding materials, a core collection that includes representative individuals from all subpopulations can be a good training population to increase the accuracy of GS (Fodor et al. 2014). The development of a field-based high-throughput phenotyping system is also necessary for promoting the efficiency of phenotypic data collection in breeding programs (Araus and Cairns 2014, Deery et al. 2014, Fiorani and Schurr 2013, Poland 2015, White et al. 2012).
The relationships between phenotypes and marker genotypes are sometimes strongly affected by non-genetic factors. Fruit quality traits are controlled by complex genetic systems and are readily influenced by environmental conditions (Chagné et al. 2014). Cultivation management, such as rootstock selection, training systems, pruning techniques, and post-harvest treatments, also influence fruit quality and yield (Myles 2013). Thus, genotype × environment × management (GEM) interactions should be taken into account when GWAS and GS are applied to breeding populations. When these interactions have a profound effect, MAS and GS will not be without complications. For example, models generated at one site would not be functional at another site with different environmental conditions (M.F.R. Resende et al. 2012a). Fruit tree species may have an advantage from this viewpoint: because most fruit tree species can be propagated clonally, multi-environmental trials can be performed using a set of the same clones to assess the influence of GE or GEM interactions on GWAS and GS in detail (e.g., M.F.R. Resende et al. 2012a). As noted by Heffner et al. (2009), however, the genotype of any individual is composed of alleles that have been evaluated in a large number of target environments, and thus it may be possible to keep the accuracy of GS high even in the presence of GE interactions if phenotypic data have been collected across many environments.
Another way to deal with GE and/or GEM interactions is to develop mathematical models describing the pattern of the environmental responses of cultivars/lines. An ecophysiological model is a method for simulating the response of plant phenotypes to environmental factors and for describing the pattern of environmental response of cultivars/lines with parameters of the models. These parameters are expected to reflect the genetic characteristics of cultivars/lines, and they can be used for QTL analysis to dissect the varietal variation in the environmental response. In apple, an ecophysiological phenology model was applied to simulate the flowering time and used to calculate the risk of damage by spring frost under climate change (Eccel et al. 2009). This type of approach will be useful for developing new cultivars that can adapt well to future climate change. Because fruit tree cultivars cannot be developed over a short time, it is necessary to develop such new varieties based on a long-term perspective.
The phenotypes of fruit trees are also affected by age-induced factors (i.e., ontogenetic factors) and their interactions with genetic factors. To clarify the genetic, ontogenetic, and environmental effects on phenotypes, it is necessary to separate confounded effects caused by consecutive years of growth (i.e., ontogenetic effect) and climatic years (i.e., environmental effect). A staggered-start experimental design (Loughin 2006) enables us to statistically separate the effects of genetic × ontogenetic and genetic × year interactions. The potential of the staggered-start design for fruit tree research was demonstrated in the genetic dissection of apple tree architecture (Segura et al. 2008).
Genomics-based approaches and genomics-assisted breeding using these methods will provide significant benefits to the genetic improvement of fruit trees, although there are some obstacles to overcome. Designing breeding programs that make optimal use of genomics-based approaches is an important task for plant breeders, because each species is unique and each requires a tailor-made solution (Lin et al. 2014). Stochastic simulations assuming a certain breeding program (e.g., Denis and Bouvet 2013, Iwata et al. 2011, Yabe et al. 2013, 2014) will be helpful for finding ways to optimize the use of genomics-based approaches in breeding programs under species-specific and/or breeding-program-specific restrictions.
This work was supported by a grant from the Ministry of Agriculture, Forestry, and Fisheries of Japan (Genomics-based Technology for Agricultural Improvement, NGB2010) and by Grants-in-Aid for Scientific Research (A), MEXT (No. 25252002), and JST/CREST.