On the use of kernel approximate Bayesian computation to infer population history

Shigeki Nakagome

doi:10.1266/ggs.90.153

ABSTRACT

Genetic variation is a product of mutation, recombination and natural selection along with a complex history involving population subdivision, gene flow and changes in population size. Elucidating the evolutionary forces that shape genetic differences among populations is a major objective of evolutionary genetics. Recent advances in high-throughput technology enable genomic data to be obtained from samples at a population-based scale. Further, the growth in computational power has facilitated extensive efforts to develop intensive simulation-based approaches with the aim of analyzing such large-scale data and making inferences about population history. Approximate Bayesian computation (ABC) provides a quantitative way to assess the goodness-of-fit of complex models that are based on previous knowledge and to estimate the parameters of interest that produce the observed data. The practical advantage of ABC is the application of Bayesian inference to any model without the need to derive a likelihood function. ABC has rapidly become popular in ecology and evolutionary studies due to the contribution it has made to improving computational efficiency over the past decade. This review provides a brief overview of the background of ABC, including potential biases in estimation due to the assumptions and approximation involved, followed by an in-depth review of one of the recently developed ABCs, “kernel ABC,” with an explanation of how to overcome these biases. Finally, the application of kernel ABC to the inference of demographic history is summarized.

INTRODUCTION

One of the main goals in evolutionary biology is to make inferences about population history. All living individuals can be traced back to a common ancestor through their genomes. Genetic variation is the product of natural selection and stochastic processes involving demographic events, such as population subdivision, migration and changes in population size. Statistical inferences provide useful and quantitative methods for understanding such a complex historical process. Model-based approaches aim to estimate parameters of interest that may explain how the observed data were generated. Several methods have been proposed for demographic and phylogenetic inferences (Cavalli-Sforza and Edwards, 1967; Thompson, 1973; Felsenstein, 1981). In particular, population genetics has been strongly influenced by the development of the coalescent theory that describes a genealogy (i.e., the path from present individual sequences to a most recent common ancestor) backwards in time under a stochastic model (Fig. 1A) (Kingman, 1982; Tajima, 1983). Bayesian inference based on the coalescent theory allows for a powerful and flexible framework for model-based approaches (Beaumont and Rannala, 2004; Beaumont et al., 2010) and has been successfully applied to inferring demographic history (Wilson and Balding, 1998; Nielsen and Wakeley, 2001; Drummond et al., 2002; Rosenberg and Nordborg, 2002; Rannala and Yang, 2003; Wilson and Rannala, 2003; Hey and Nielsen, 2007).

Fig. 1.

Illustration of the coalescent process and genealogy. (A) A simplified schematic of DNA evolution. Each row corresponds to a single generation, and each circle denotes a DNA sequence. This is a constant size model, so every generation has the same number of samples. All five sequences colored with orange were sampled from the current generation. The parental sequences that lead to the present day samples are also indicated in orange. A pair of sequences coalesces backwards in time because they share an ancestor. The last point of coalescence is a common ancestor for all sequences, which is the most recent common ancestor (MRCA). (B) The genealogy for the five sequences sampled at the present day. The waiting time before which the first coalescence happens among the five sequences is shown as T₅. The waiting times of the subsequent coalescence events correspond to T₄, T₃ and T₂. The time to MRCA (T_MRCA) is the sum of T₅, T₄, T₃ and T₂. (C) Since neutral mutations do not affect the genealogy, they are superimposed on the tree. The five DNA sequences are denoted as A–E. Six mutations (Seg), shown as stars, occur in the genealogy. The site frequency spectrum (SFS) is used to calculate how many mutations are present on one (ξ₁), two (ξ₂), three (ξ₃) and four (ξ₄) sequences in the genealogy. There are three mutations of ξ₁, two mutations of ξ₂ and one mutation of ξ₃.

Bayesian inference consists of three main steps: (1) setting up a model, (2) conditioning the model on observed data, and (3) evaluating the fit of the model (Gelman et al., 2004). The first step is performed by constructing models based on previous knowledge and experience. Such information can be used to define a likelihood function (i.e., the probability of the observed data under a set of parameters) and prior distribution (i.e., information about a parameter before obtaining the observed data). The second step is to estimate a posterior distribution that is proportional to the product of the likelihood and prior. The posterior distribution is characterized by a point estimate (mode or mean) and an interval estimate (a set of true parameters with a certain probability), which can be used to interpret how the parameter changes by conditioning using the observed data. However, several hypotheses are often proposed of an evolutionary history that produces the observed data. Hence, comparing models and evaluating their fit with the observed data is the fundamental step in making inferences about population history. The third step is to evaluate the goodness-of-fit of the models and to identify the most parsimonious scenario. Models are a general approximation of a true history and can be improved by re-formulating (i.e., going back to the first step) and repeating these steps.

One of the difficulties in applying Bayesian inference to evolutionary studies is computing the likelihood function. This difficulty is due to the complexity of the models (e.g., history from multiple populations and changes in population size) and of the data (e.g., genome-wide data from multiple populations). In population genetics, a genealogy is treated as a nuisance parameter and the likelihood is averaged over the possible genealogy states under a set of parameters (e.g., population size and time of population subdivision) (Rosenberg and Nordborg, 2002). However, if the model of interest is complex, the structure of the genealogy becomes complicated and the parameters constituting the genealogy become broader, making it difficult to determine the likelihood function. Further, the explosive growth of high-throughput technology has allowed genomic data to be obtained at a population-based scale. Since previous techniques were based on simple models (e.g., a panmictic population of constant size) and applied to data of a limited size (e.g., single locus or multiple loci from a limited number of samples), these techniques are computationally expensive or completely infeasible for evaluating likelihood in genomic data. Therefore, simulation-based methods have been developed that can handle such large-scale data without the need to derive the likelihood function (Marjoram and Tavare, 2006).

The development of simulation-based Bayesian computation has been facilitated by the coalescent theory, which provides efficient methods to simulate samples of DNA fragments under many genetic and demographic scenarios (Hudson, 1990). Here, θ is a parameter of interest and D is the observed data. The goal is to sample θ from a posterior distribution that is defined as π(θ|D) ∝ f(D|θ)π(θ), where f(D|θ) is a likelihood and π(θ) is a prior θ. The simplest Bayesian computation is a rejection sampling (Ripley, 1987). This algorithm repeatedly simulates data (D′) using a parameter (θ′) randomly sampled from the prior. D′ is accepted or rejected based on the probability, which is proportional to the likelihood. The rejection method is useful if the likelihood function is tractable.

One application of rejection sampling is to infer the time to the most recent common ancestor (T_MRCA) based on the number of segregating sites (Seg) (Tavare et al., 1997). T_MRCA is the sum of the coalescent times in a genealogy consisting of n samples (Fig. 1B). T_i is the waiting time of coalescence between i sequences (i = n to 2), and it follows an exponential distribution with a parameter of i(i − 1)/2. The likelihood represents the probability of an observation, Seg, under a set of parameters, T = (T_n, ..., T₂), and it follows a Poisson distribution. Thus, the algorithm of the rejection method is composed of two steps (Tavare et al., 1997). First, a set of parameters, t = (t_n, t_n₋₁, …, t₂), is generated from the exponential distribution, and the total branch length (l) in the genealogy is calculated from l = 2t₂ + … + (n − 1)t_n₋₁ + nt_n. Second, the simulated data is accepted with the likelihood computed from the ratio between the Poisson point probabilities of Po( l θ ˆ W /2){Seg}/Po(Seg){Seg}, where Po(λ){Seg} = e⁻^λλ^Seg/Seg!. The numerator is the probability of observing Seg in the simulated genealogy with the mean of l θ ˆ W /2, where θ ˆ W = k( 1+ 1 2 +⋅⋅⋅ 1 (n-1) ) (Watterson, 1975). Meanwhile, the denominator is the probability in the observed genealogy with the mean of Seg. By repeating these steps many times, samples of T_MRCA can be obtained from π(T|Seg). For example, the modified algorithm of the rejection sampling (Slatkin and Rannala, 1997; Tavare, 2004) has been applied to estimating the time of unique event polymorphisms (e.g., insertion of transposable elements) among Bombyx mandarina and B. mori (Nakagome et al., 2013d).

Demand for testing complicated models and recent improvements in computational power have facilitated the development of simulation-based and likelihood-free methods in Bayesian inference. Approximate Bayesian computation (ABC) is a flexible framework for estimating a posterior distribution without making explicit use of the likelihood function (Fu and Li, 1997; Tavare et al., 1997; Weiss and von Haeseler, 1998; Pritchard et al., 1999; Beaumont et al., 2002). As the scale of genetic data expands, approximation must be involved to summarize the data into a set of statistics. Summary statistics are values that capture complicated data in a simpler way by reducing the dimensionality of the data. In population genetics, several useful statistics have been developed to evaluate patterns of genetic variation, including the number of segregating sites (Seg), nucleotide diversity (π) (Nei and Li, 1979), Tajima’s D (TD) (Tajima, 1989), and Fu and Li’s D (FLD) and F (FLF) (Fu and Li, 1997). The aim of ABC is to approximate the posterior distribution of parameters and determine the evolutionary history behind the observed data.

BASIC ALGORITHMS OF ABC

The most basic form of ABC, the rejection algorithm, involves sampling parameters from the posterior distribution approximated using a set of summary statistics, s = (s₁, s₂, …, s_m), where m is the number (i.e., dimension) of summary statistics.

A1. Generate θ′ from π(·).

A2. Simulate D′ using θ′.

A3. Accept θ′ if s(D) = s(D′), and go to A1.

The resulting posterior distribution is π(θ|s(D)), which is the approximation of π(θ|D). Although summary statistics are used to reduce the computational cost, this algorithm requires a lot of time to accept the exact same statistics as the observation. Another approximation involves using a tolerance (η) that defines the degree of similarity between the observed and simulated summary statistics and relaxes the conditions for accepting the parameters. Step A3 is replaced with “Accept θ′ if d(s(D), s(D′)) < η, and go to 1”, where d(·,·) is a function measuring the similarity between the two sets of summary statistics. The practical outcome of the ABC rejection algorithm is sampling from π(θ|d(s(D), s(D′)) < η).

To correct discrepancies between the simulated and observed summary statistics, local linear regression was incorporated into ABC (Beaumont et al., 2002). After accepting parameters by ABC rejection, this method assigns weights to the accepted parameters according to the similarity between observed and simulated summary statistics. Each of the weighted parameters can be taken as a random draw from π(θ|s(D)), and the posterior mean of a parameter is given as

𝔼 ˆ [ f(θ)|s(𝒟) ] = ∑ i=1 n { θ i - β ˆ ( s( 𝒟 i )-s(𝒟) ) } k η ( ‖ s( 𝒟 i )-s(𝒟) ‖ ) ∑ i=1 n k η ( ‖ s( 𝒟 i )-s(𝒟) ‖ ) ,

where an Epanechnikov kernel is used to calculate weights from k_η(t) = { η -1 ( 1- (t/η) 2 ) ,t≤δ 0, t>δ . The choice of kernel function is arbitrary, and another function could be used if the weights steeply decrease to zero as t increases. Regression ABC can improve the properties of posterior estimates (Beaumont et al., 2002). These rejection and regression ABCs are commonly known as standard ABCs.

ABC can be also applied to evaluate the goodness-of-fit of a model. In the case where there are two alternative models of M₁ and M₂ (e.g., constant population size vs. population expansion, or no migration vs. migration), a Bayes factor (BF) representing how much more likely one model is than the other is given as

BF= P( s(𝒟)| M 1 ) P( s(𝒟) M 2 ) ,

where P(s(D)|M) is an approximated marginal likelihood of a given model. In ABC, an estimator of the marginal likelihood is given as the acceptance rate under a given model,

P( s(𝒟)|M ) = ˆ 1 n ∑ i=1 n I( s( 𝒟 i )=s(𝒟) ) ,

where I(·) is an indicator function and n is the total number of simulations. To remove the need to accept an exact match between the simulated and observed summary statistics and to increase the computational efficiency, the acceptance rate is calculated using the same tolerance for each model under the standard ABCs (Pritchard et al., 1999; Fagundes et al., 2007; Beaumont, 2008).

POTENTIAL BIASES IN ABC-BASED STATISTICAL INFERENCE

Several technical improvements to standard ABCs have increased the accuracy of posterior estimates and computational efficiency (Marjoram et al., 2003; Sisson et al., 2007; Beaumont et al., 2009; Toni et al., 2009). However, most ABC-based methods are subject to biases from (1) a lack of sufficient statistics and (2) the use of non-zero tolerance. Summarizing data into a set of statistics implies disregarding useful information to some extent. Hence, the posterior distribution given summary statistics is not equal to the posterior distribution given the actual observed data (Marjoram and Tavare, 2006). In addition, another layer of approximation is introduced to reduce computational time by setting up positive values of tolerance. The flexibility and caveats of using ABC have been well summarized in several review articles (Marjoram and Tavare, 2006; Beaumont, 2010; Beaumont et al., 2010; Csillery et al., 2010; Sunnaker et al., 2013).

For the first source of bias, accepted parameters are the samples from an approximation of π(θ|D), unless a set of summary statistics is sufficient for the parameters of interest. Since the likelihood function is unknown in ABC, it is almost impossible to obtain sufficient statistics for genetic data. The availability of informative statistics may be limited to particular models. For instance, the number of segregating sites can be useful for estimating mutation rates but provides little information about demographic history, such as population bottlenecks and expansion (Hey and Machado, 2003). An intuitive way to overcome the lack of sufficient statistics is to increase the number of summary statistics and keep the loss of information to a minimum. The approximation of posterior distribution given data can be improved by adding summary statistics if the new set of summary statistics refines an existing set of statistics and provides more detailed information about the data (Nakagome et al., 2013a). For example, the site frequency spectrum (SFS) is a function of statistics commonly used in summarizing the pattern of genetic variation (Fig. 1C). SFS consists of the site frequency, ξ_i, which is defined as the number of segregating sites where a mutant nucleotide is present on i sequences in the sample (i = 1, …, n). Seg, π, TD, FLD, and FLF are calculated as a function of SFS; Seg= ∑ i=1 n ξ i , π=1/n ∑ i=1 n-1 i (n-1) ξ i , TD= (π-Seg/ a 1 ) / Var ˆ [ π-Seg/ a 1 ] where a 1 = ∑ i=1 n-1 1/i , FLD= { Seg/ a 1 - 1/n } ξ 1 } / Var ˆ [ Seg/ a 1 -( n-1/n ) ξ 1 ] , and FLF= { π- ( n-1/n ) ξ 1 } / Var ˆ [ π-( n-1/n ) ξ 1 ] .

Another concern when increasing the number of summary statistics is a reduction in the accuracy and stability of ABC. The acceptance rate sharply decreases using high-dimensional summary statistics, which may increase the variance of posteriors due to the small number of accepted parameters (Beaumont et al., 2002; Nakagome et al., 2013a). The space defined by a tolerance becomes the multi-dimensional sphere. Even though a similarity between simulated and observed summary statistics is small enough to be included within the tolerance, most of the simulated statistics are located on the boundary of the sphere. Consequently, it would take an enormous amount of time to find parameters that could produce summary statistics truly close to the observation (Hastie et al., 2009). Alternatively, various methods have been developed for choosing useful statistics from many kinds of summary statistics (Joyce and Marjoram, 2008; Wegmann et al., 2009; Blum and Francois, 2010; Nunes and Balding, 2010; Fearnhead and Prangle, 2012) depending on the model.

For the second source of bias, the choice of a positive tolerance involves a trade-off between the bias and the variance of a posterior estimate. With a sufficiently small tolerance, the resulting posterior distribution should be close to the target posterior distribution, which is π(θ|s(D)). As the acceptance rate decreases, the small number of accepted samples increases the variance of posterior distribution. In contrast, a large tolerance allows prior samples to be easily accepted, which reduces the variance but increases the bias from the prior. To this end, regression ABC has been developed to reduce the variance of the posterior estimates without decreasing the accepting conditions. Other studies have addressed the bias introduced by non-zero tolerance by proposing schemes to successively reduce tolerance using a sequential Monte Carlo algorithm (Sisson et al., 2007; Beaumont et al., 2009; Drovandi and Pettitt, 2011). Given a measuring error in data or a modeling error in statistical hypotheses, the optimal tolerance has been shown to be η ≠ 0 (Fearnhead and Prangle, 2012; Wilkinson, 2013). Noisy ABC has been developed to characterize and compensate for the bias from non-zero tolerance by introducing a specific form of errors to sampling parameters from a posterior distribution.

INCORPORATING KERNEL METHODS INTO ABC

To minimize the biases involved in the standard ABCs, more summary statistics can be used that do not involve any arbitrary choice of simulated data. Kernel methods provide systematic ways of analyzing high-dimensional data by transforming the data into high or potentially infinite space, which is called reproducing kernel Hilbert space (RKHS) (Hofmann et al., 2008; Fukumizu, 2010). In this space, the similarity of summary statistics is evaluated as the inner product, and each inner product is computed by a kernel function such as the Gaussian RBF kernel, which is defined as k(s_i,s_j) = exp ( - 1 2 σ 2 ‖ s i - s j ‖ 2 ) . The framework of ABC integrated with kernel ridge regression, “kernel ABC,” was originally proposed by Fukumizu and coworkers (Fukumizu et al., 2011, 2013) and applied as an inference method for population genetics (Nakagome et al., 2013a). Kernel ABC is an extension of regression conditioned on observation that (1) stabilizes the computation based on high-dimensional summary statistics and (2) non-linearizes the relationship between summary statistics and parameters. First, varying degrees of correlation among summary statistics are an issue when computing regression coefficients. Ridge regression can be useful for resolving this issue by adding some degree of bias to regression estimates. Second, there is little information regarding the dependency between summary statistics and parameters. Non-linearized ridge regression, “kernel ridge regression,” evaluates the similarity of summary statistics in RKHS and estimates regression coefficients based on the framework of ridge regression. The algorithm of kernel ABC is as follows:

B1. Generate θ′ from π(·).

B2. Simulate D′ using θ′.

B3. Calculate s(D′), and go to B1.

B4. Compute 𝔼 [f(θ)|s(D)] by kernel ridge regression.

In this last step (B4), kernel ABC is used to compute a posterior mean from the sum of weighted parameters, and each weight is calculated based on the similarity between the observed and simulated summary statistics.

There are two main advantages of kernel ABC. First, a posterior mean given observed summary statistics is estimated without rejection. Hence, the number of simulations required is much smaller (e.g., 10,000 or 20,000) than in the standard ABCs (e.g., one million to accept 100 or 1,000 samples). In addition, kernel ABC is able to accurately estimate posterior means with a small number of simulations, so it requires less computational time than other ABCs (Nakagome et al., 2013a). Second, the estimator of the posterior mean in kernel ABC is consistent with the posterior mean given observed summary statistics, 𝔼 [f(θ)|s(D)] (Nakagome et al., 2013a). Therefore, kernel ABC avoids the potential bias introduced by the use of tolerance (see above) and improves the approximation of the true posterior mean 𝔼 [f(θ)|(D)] using high-dimensional summary statistics.

This idea of kernel methods has been extended to model selection based on an approximated Bayes factor (aBF) (Osada et al., 2013). Applying model selection for ABC needs a long time until enough simulations are accepted if high-dimensional summary statistics are used. Therefore, Osada et al. (2013) proposed an aBF that takes a ratio of P(s(D)|M₁) to that of P(s(D)|M₂) computed from a kernel density estimate of P(s(D)|M) = 1 n ∑ i=1 n k (s(D_i), s(D)). The approximated marginal likelihood (aML) is interpreted as the average similarity between s and s_i in a given model. Therefore, the kernel method likely provides a useful way to improve the framework of Bayesian analysis (i.e., setting up a model, conditioning the model on observed data, and evaluating the fit of the model).

THE USE OF HIGH-DIMENSIONAL SUMMARY STATISTICS IN KERNEL ABC

A large number of summary statistics can provide detailed information about data. The site frequency spectrum (SFS) is a refinement of the number of segregating sites (Seg) (Fig. 1C). For example, in Fig. 1, the genealogy has Seg = 6, while SFS = {3,2,1,0,0} represents how many mutations occur in the tips leading directly to each type of sequence or in the internal branches (i.e., shared branches among some of the sequences). To evaluate how the use of a large number of summary statistics improves estimating posterior means, two examples are given in this section under a simple and a complex demographic model.

The first model is the constant population size (Fig. 2A). The observed data are simulated under a population size of 10,000 with a mutation rate of 2.5 × 10⁻⁹ (per bp per generation), a sample of 100 sequences, and a 100-kb non-recombining region. The sequence data are summarized into Seg = 49 or SFS = {28,6,4,3,2,1,5}, in which SFS is divided into seven bins with frequencies of 0–8%, 8–16%, 16–24%, 24–32%, 32–40%, 40–48% and 48–100%. The parameter of interest is the population size (N), and the prior is given as the log-normal distribution (LN) of N~LN(10000, 10000²). Under the constant size model, the likelihood function is analytically tractable, and the true posterior mean, 𝔼 [N|D], is estimated to be 10,498. However, the posterior estimates based on kernel ABC differ between the high-dimensional (SFS) and low-dimensional (Seg) summary statistics: the posterior mean given SFS is much closer to the true estimate than the posterior given Seg (Nakagome et al., 2013a). Seg is the sum of SFS (see “POTENTIAL BIASES IN ABC-BASED STATISTICAL INFERENCE”), and SFS is likely to correctly capture differences in data generated from different values of N. Therefore, the estimate from SFS is more accurate than that from Seg. This example highlights how using high-dimensional summary statistics can improve the accuracy of posterior estimates.

Fig. 2.

(A) A simple model of constant population size of 10,000 (Nakagome et al., 2013a). The posterior means in the table below (A) are estimated by the Bayesian computation with the full set of data (D) and by kernel ABC with SFS (s_SFS), or Seg (s_Seg), respectively. (B) A complex model of population divergence in which the ancestral population with a size of 100,000 (N_ANC) split into two populations with the same population size of N_ANC 25,000 generations ago (Osada et al., 2013). The total number of sequences is 30 and 18 in populations A and B. Sequence data are summarized into haplotype frequency spectrum (HFS) as well as SFS. The bin sizes of SFS and HFS are defined as 0, 1–4, 5–8, 9–12 and 13–30 for population A and 0, 1–3, 4–6, 7–9 and 10–18 for population B. The posterior means estimated by kernel ABC are shown in the table below (B) that includes the four parameters.

The second example is a complex model that assumes that the population A split from the population B (Fig. 2B). This model is constructed as the example of macaque demography (Osada et al., 2013) and consists of four parameters: population size (N_A, N_B, and N_ANC) and the time of population divergence (T). DNA sequence data include 26 unlinked loci with a length of 700 bp from two populations (n_A = 30 and n_B = 18). The observed data are simulated under the following conditions: constant population size (N_A = N_B = N_ANC = 100,000), T = 25,000 generations ago, a mutation rate of 7.2 × 10⁻⁹, and a recombination rate of 7.3 × 10⁻⁹. A one-dimensional SFS may reflect the pattern of mutation in each population, while the amount of genetic variation shared between two populations could be also useful for estimating when the population split occurred. Further, the effects of recombination are incorporated into this model, so haplotype-based information may also be necessary to make inferences about the complex history. Thus, the observed data is summarized into a two-dimensional SFS and haplotype frequency spectrum (HFS) that represents how many mutations and haplotypes are specific to each population and how many are shared between the populations. The posterior means are almost consistent with the true values, N_ANC = 102,064, N_A = 104,553, N_B = 98,028, and T = 25,602, even though the prior means are twice as large as the true values (Osada et al., 2013). The number of summary statistics may need to be increased in correspondence with the complexity of the model, and kernel ABC is powerful enough to handle high-dimensional statistics and untangle complex evolutionary history.

APPLICATION OF KERNEL ABC TO THE INFERENCE OF DEMOGRAPHIC HISTORY

One of the main goals in evolutionary biology is to understand the demographic processes that natural populations have experienced over time. The application of kernel ABC has yielded new insights into the evolution of wild animals and human populations. Two such examples include the divergence of brown bears and polar bears (Nakagome et al., 2013c) and the migration of humans into the Ryukyu Islands, which are the southernmost islands of the Japanese archipelago (Sato et al., 2014).

The speciation of polar bears from brown bears is of considerable interest to evolutionary geneticists. Their phylogenetic relationship is discordant between mitochondrial DNA (mtDNA) and autosomal DNA (aDNA). A paraphyletic model in which the polar bear lineage is derived from brown bear diversity is supported by mtDNA sequences (Talbot and Shields, 1996), while a study based on multiple loci in aDNA sequences proposed a monophyletic model in which polar bears split from the ancestral lineage of all brown bears (Hailer et al., 2012). Further, the majority of the discordance is due to stochastic processes in which ancestral polymorphisms and subsequent incomplete lineage sorting resulted in a unique gene genealogy (Nakagome et al., 2013b). These studies highlight the general issues in inferring evolutionary history when different genes produce different trees. This genealogical discordance represents a classic conundrum in ecology and evolutionary biology of whether phylogenetics or population genetics should be applied to DNA sequence data. Nakagome et al. (2013b) employed population genetics approaches based on the coalescent model. The estimates from kernel ABC show a large effective population size of mtDNA in brown bears and highlight significant deviation in the effective population size ratio of mtDNA to aDNA from the theoretically expected ratio (Nakagome et al., 2013c). This suggests that ancestral brown bear populations were structured by behavioral traits including female philopatry and male-biased migration (Fig. 3A). Thus, ancestral polymorphisms and sex-biased migration are likely to contribute to conflicting branching patterns of brown bears and polar bears across aDNA genes and mtDNA.

Fig. 3.

Applications of kernel ABC for the inference of population history. (A) Evolutionary history of brown bears and polar bears (Nakagome et al., 2013c). The initial divergence of the brown bear population occurred 1.37 million years ago (MYA). Geographic lineages of brown bears gradually diverged from the ancestral lineage of other brown bears and polar bears and expanded into broad geographic regions. Afterwards, the polar bear lineage appeared from the ancestral lineage at 0.31 MYA and differentiated into the arctic region. During this process, male brown bears migrated among the sub-populations (male-biased migration), while female brown bears remained in each of the sub-populations (female philopatry). Thus, geographic differentiations of brown bear lineages have been shown in the mtDNA genealogy, although these patterns were likely homogenized in aDNA genealogies by male-biased migration between geographic regions (Nakagome et al., 2008, 2013c). (B) Demographic models of population divergence between people living on Okinawa and the Miyako Islands (Sato et al., 2014). The two populations are assumed to have split at different times (T) sampled from different priors: T~LN(50, 50²) in the recent model, T~LN(100, 100²) in the intermediate model and T~LN(400, 400²) in the ancient model. Migrations are also incorporated into the models. The table represents log₁₀-scaled aBFs calculated from the ratios of the models to the highest aML model (“Recent” and “0”). The models of “Recent” and “0” and “Intermediate” and “10⁻³” show significantly higher aMLs than the other models. Assuming one generation is approximately 25 years, the divergence of the Okinawa and Miyako populations is likely to have occurred 1,800 or 2,600 years ago, supporting the “replacement” model.

It is also of great interest to understand how human populations have expanded from continents to islands since the human migration out of Africa. Contemporary island populations may be descendants of ancient people who left their oldest bones on the island (“population continuity”) or they may be recent migrants who have come from continents and replaced the ancient people (“replacement”). The peopling of the Ryukyu Islands is one of the models used to address this issue. Sato et al. (2014) found that the people living on the Miyako Islands had genetically diverged from those on Okinawa island. Meanwhile, 26,000-year-old human remains have been found on the Miyako Islands (Sakura, 1985). If the contemporary people on the Miyako Islands are direct descendants of these ancient people, then the time of divergence between the Miyako and Okinawa populations should have occurred within the Pleistocene. Sato et al. (2014) proposed three models in which the divergence occurred in recent, intermediate or ancient times (Fig. 3B), and the recent divergence model was significantly more likely than the other models based on aBF. Incorporating migration between the two populations would be expected to substantially push back the time of divergence. Although marginal likelihoods were estimated for several models with weak and strong migration, recent divergence with no or weak migration was still strongly supported by model selection (Sato et al., 2014). Assuming that one generation is approximately 25 years, the divergence of the Okinawa and Miyako populations is likely to have happened 1,800 or 2,600 years ago. Therefore, the first inhabitants of the Ryukyu Islands during the Pleistocene were likely replaced with recent migrants.

CONCLUSION

In conclusion, ABC is a flexible and powerful method for Bayesian inference. There is no need to formulate a likelihood function, and ABC can be applied for any complex model. However, caution is required for the reliable use of ABC due to potential biases in summarizing and choosing data. Kernel ABC enables high-dimensional summary statistics to be used without involving any arbitrary choice of simulation data. The recent development of high-throughput technology dramatically reduces the cost of collecting large data sets. Further, population-scale genomic data have become available in both model and non-model organisms. The challenges remain in extracting useful information from the data and analyzing the data statistically and quantitatively. ABC is a promising statistical framework, and the application of kernel methods to ABC is a starting point to address these questions and to understand the complex history of demographic processes and of adaptation to local environments.

ACKNOWLEDGMENTS

Thanks to Shuhei Mano and two anonymous reviewers for helpful comments on this manuscript. The author was supported by the Japan Society for the Promotion of Science Postdoctoral Fellowships for Research Abroad (26-541).

REFERENCES

Beaumont, M. A. (2008) Joint determination of topology, divergence time and immigration in population trees. In: Simulation, Genetics, and Human Prehistory, (eds.: Matsumura, S., Forster, P., and Renfrew, C.), pp. 135–154. McDonald Institute for Archaeological Research, Cambridge.
Beaumont, M. A. (2010) Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379–406.
Beaumont, M. A., and Rannala, B. (2004) The Bayesian revolution in genetics. Nat. Rev. Genet. 5, 251–261.
Beaumont, M. A., Zhang, W., and Balding, D. J. (2002) Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035.
Beaumont, M. A., Cornuet, J. M., Marin, J. M., and Robert, C. P. (2009) Adaptive approximate Bayesian computation. Biometrika 96, 983–990.
Beaumont, M. A., Nielsen, R., Robert, C., et al. (2010) In defence of model-based inference in phylogeography. Mol. Ecol. 19, 436–446.
Blum, M. G. B., and Francois, O. (2010) Non-linear regression models for Approximate Bayesian Computation. Stat. Comput. 20, 63–73.
Cavalli-Sforza, L. L., and Edwards, A. W. (1967) Phylogenetic analysis. Models and estimation procedures. Am. J. Hum. Genet. 19, 233–257.
Csillery, K., Blum, M. G., Gaggiotti, O. E., and Francois, O. (2010) Approximate Bayesian Computation (ABC) in practice. Trends Ecol. Evol. 25, 410–418.
Drovandi, C. C., and Pettitt, A. N. (2011) Estimation of parameters for macroparasite population evolution using approximate Bayesian computation. Biometrics 67, 225–233.
Drummond, A. J., Nicholls, G. K., Rodrigo, A. G., and Solomon, W. (2002) Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161, 1307–1320.
Fagundes, N. J., Ray, N., Beaumont, M., Neuenschwander, S., Salzano, F. M., Bonatto, S. L., and Excoffier, L. (2007) Statistical evaluation of alternative models of human evolution. Proc. Natl. Acad. Sci. USA 104, 17614–17619.
Fearnhead, P., and Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. B 74, 419–474.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376.
Fu, Y. X., and Li, W. H. (1997) Estimating the age of the common ancestor of a sample of DNA sequences. Mol. Biol. Evol. 14, 195–199.
Fukumizu, K. (2010) Introduction to Kernel Methods (in Japanese). Asakura Shoten, Tokyo.
Fukumizu, K., Song, L., and Gretton, A. (2011) Kernel Bayes’ rule. In: Advances in Neural Information Processing Systems 24, (eds.: J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger), pp. 1737–1745. Curran Associates, Inc., Red Hook, NY.
Fukumizu, K., Song, L., and Gretton, A. (2013) Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 14, 3753–3783.
Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2004) Bayesian Data Analysis, (2nd edn). Chapman and Hall/CRC, Boca Raton, FL.
Hailer, F., Kutschera, V. E., Hallstrom, B. M., Klassert, D., Fain, S. R., Leonard, J. A., Arnason, U., and Janke, A. (2012) Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science 336, 344–347.
Hastie, T., Tibshirani, R., and Friedman, J. (2009) The elements of statistical learning. Springer-Verlag, New York.
Hey, J., and Machado, C. A. (2003) The study of structured populations--new hope for a difficult and divided science. Nat. Rev. Genet. 4, 535–543.
Hey, J., and Nielsen, R. (2007) Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc. Natl. Acad. Sci. USA 104, 2785–2790.
Hofmann, T., Scholkopf, B., and Smola, A. J. (2008) Kernel methods in machine learning. Ann. Stat. 36, 1171–1220.
Hudson, R. R. (1990) Gene genealogies and the coalescent process. In: Oxford Surveys in Evolutionary Biology Vol. 7, (eds.: Futuyma, D., and Antonovics, J.), pp. 1–44. Oxford Univ. Press, Oxford, UK.
Joyce, P., and Marjoram, P. (2008) Approximately sufficient statistics and Bayesian computation. Stat. Appl. Genet. Mol. 7, doi: 10.2202/1544-6115.1389.
Kingman, J. F. C. (1982) On the genealogy of large populations. J. Appl. Prob. 19, 27–43.
Marjoram, P., and Tavare, S. (2006) Modern computational approaches for analysing molecular genetic variation data. Nat. Rev. Genet. 7, 759–770.
Marjoram, P., Molitor, J., Plagnol, V., and Tavare, S. (2003) Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100, 15324–15328.
Nakagome, S., Pecon-Slattery, J., and Masuda, R. (2008) Unequal rates of Y chromosome gene divergence during speciation of the family Ursidae. Mol. Biol. Evol. 25, 1344–1356.
Nakagome, S., Fukumizu, K., and Mano, S. (2013a) Kernel approximate Bayesian computation in population genetic inferences. Stat. Appl. Genet. Mol. Biol. 12, 667–678.
Nakagome, S., Mano, S., and Hasegawa, M. (2013b) Comment on “Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage”. Science 339, 1522.
Nakagome, S., Mano, S., and Hasegawa, M. (2013c) Ancestral polymorphisms and sex-biased migration shaped the demographic history of brown bears and polar bears. PLoS One 8, e78813.
Nakagome, S., Nakajima, Y., and Mano, S. (2013d) Biogeography revealed by mariner-like transposable element sequences via a Bayesian coalescent approach. J. Mol. Evol. 77, 64–69.
Nei, M., and Li, W. H. (1979) Mathematical-model for studying genetic-variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76, 5269–5273.
Nielsen, R., and Wakeley, J. (2001) Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158, 885–896.
Nunes, M. A., and Balding, D. J. (2010) On optimal selection of summary statistics for approximate Bayesian computation. Stat. Appl. Genet. Mol. 9, doi: 10.2202/1544-6115.1576.
Osada, N., Nakagome, S., Mano, S., Kameoka, Y., Takahashi, I., and Terao, K. (2013) Finding the factors of reduced genetic diversity on X chromosomes of Macaca fascicularis: male-driven evolution, demography, and natural selection. Genetics 195, 1027–1035.
Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. (1999) Population growth of human Y chromosomes: A study of Y chromosome microsatellites. Mol. Biol. Evol. 16, 1791–1798.
Rannala, B., and Yang, Z. (2003) Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656.
Ripley, B. D. (1987) Stochastic Simulation. John Wiley & Sons, New York.
Rosenberg, N. A., and Nordborg, M. (2002) Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3, 380–390.
Sakura, H. (1985) Pleistocene human fossil remains from Pinza-Abu (Goat Cave), Miyako Island, Okinawa, Japan. In: Reports on Excavation of the Pinza-Abu Cave, (ed.: Cultural Section, Education Agency of Okinawa Prefecture), pp. 161–176. Department of Education, Okinawa Prefectural Government, Naha.
Sato, T., Nakagome, S., Watanabe, C., et al. (2014) Genome-wide SNP analysis reveals population structure and demographic history of the Ryukyu islanders in the southern part of the Japanese archipelago. Mol. Biol. Evol. 31, 2929–2940.
Sisson, S. A., Fan, Y., and Tanaka, M. M. (2007) Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 104, 1760–1765.
Slatkin, M., and Rannala, B. (1997) Estimating the age of alleles by use of intraallelic variability. Am. J. Hum. Genet. 60, 447–458.
Sunnaker, M., Busetto, A. G., Numminen, E., Corander, J., Foll, M., and Dessimoz, C. (2013) Approximate Bayesian computation. PLoS Comput. Biol. 9, e1002803.
Tajima, F. (1983) Evolutionary relationship of DNA sequences in finite populations. Genetics 105, 437–460.
Tajima, F. (1989) Statistical-method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595.
Talbot, S. L., and Shields, G. F. (1996) Phylogeography of brown bears (Ursus arctos) of Alaska and paraphyly within the Ursidae. Mol. Phylogenet. Evol. 5, 477–494.
Tavare, S. (2004) Ancestral inference in population genetics. In: Lectures on Probability Theory and Statistics, Ecole d’Eté de Probabilités de Saint-Flour XXXI-2001, (ed.: J. Picard), pp. 1–188. Springer-Verlag, Berlin, Heidelberg, Germany.
Tavare, S., Balding, D. J., Griffiths, R. C., and Donnelly, P. (1997) Inferring coalescence times from DNA sequence data. Genetics 145, 505–518.
Thompson, E. A. (1973) The Icelandic admixture problem. Ann. Hum. Genet. 37, 69–80.
Toni, T., Welch, D., Strelkowa, N., Ipsen, A., and Stumpf, M. P. (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface 6, 187–202.
Watterson, G. A. (1975) On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7, 256–276.
Wegmann, D., Leuenberger, C., and Excoffier, L. (2009) Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics 182, 1207–1218.
Weiss, G., and von Haeseler, A. (1998) Inference of population history using a likelihood approach. Genetics 149, 1539–1546.
Wilkinson, R. D. (2013) Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Stat. Appl. Genet. Mol. Biol. 12, 129–141.
Wilson, G. A., and Rannala, B. (2003) Bayesian inference of recent migration rates using multilocus genotypes. Genetics 163, 1177–1191.
Wilson, I. J., and Balding, D. J. (1998) Genealogical inference from microsatellite data. Genetics 150, 499–510.

Corresponding author

Register with J-STAGE for free!