Genes & Genetic Systems
Online ISSN : 1880-5779
Print ISSN : 1341-7568
ISSN-L : 1341-7568
Full papers
A new inference method for detecting an ongoing selective sweep
Naoko T. FujitoYoko Satta Toshiyuki HayakawaNaoyuki Takahata
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML
Supplementary material

2018 Volume 93 Issue 4 Pages 149-161

Details
ABSTRACT

A simple method was developed to detect signatures of ongoing selective sweeps in single nucleotide polymorphism (SNP) data. Based largely on the traditional site frequency spectrum (SFS), the method additionally incorporates linkage disequilibrium (LD) between pairs of SNP sites and uniquely represents both SFS and LD information as hierarchical “barcodes.” This barcode representation allows the identification of a hitchhiking genomic region surrounding a putative target site of positive selection, or a core site. Sweep signals at linked neutral sites are then measured by the proportion (Fc) of derived alleles within the hitchhiking region that are linked in the derived allele group defined at the core site. In measuring Fc or intra-allelic variability in an informative way, certain conditions for derived allele frequencies are required, as illustrated with the human ST8SIA2 locus. Coalescent simulators with and without positive selection are used to assess the false-positive and false-negative rates of the Fc statistic. To demonstrate its power, the method was further applied to the LCT, OCA2, EDAR, SLC24A5 and ASPM loci, which are known to have undergone positive selection in human populations. Overall, the method is powerful and can be used to identify core sites responsible for ongoing selective sweeps.

INTRODUCTION

Population genomics has identified a large number of genetic variants that are associated with either global or local adaptations of species, in particular of anatomically modern humans (e.g., Li et al., 2014; Schrider and Kern, 2017). As reviewed in Lachance and Tishkoff (2013), Vitti et al. (2013) and Fan et al. (2016), there are several approaches for detecting positive Darwinian selection from DNA polymorphisms. One approach is based on the allele frequency distribution or the site frequency spectrum (SFS) and uses either summary statistics (e.g., Watterson, 1975; Tajima, 1989; Fu and Li, 1993; Fay and Wu, 2000; Zeng et al., 2006; Field et al., 2016) or the entire allele frequency distribution (e.g., Sawyer and Hartl, 1992; Bustamante et al., 2001; Nielsen et al., 2005; Gutenkunst et al., 2009; Pavlidis et al., 2013). However, when positive selection is ongoing, the allele frequency at a selected site may not be sufficiently high, which naturally results in reduced statistical power of the SFS-based methods. It is also known that the inference of positive selection made by these SFS-based methods tends to be muddled by the confounding effect of demography if they attempt to detect deviations from the standard neutral model of constant population size. In addition, these methods often assume no recombination within a locus but free recombination between loci or their loose linkage, and are extremely sensitive to these assumptions on recombination as linkage inflates the variance of allele frequencies (Bustamante et al., 2001).

Li (2011) proposed an alternative method that uses the maximum frequency of derived alleles in a sample as an indicator of unbalanced coalescent trees. This method relies on the fact that the probability of unbalanced basal branches is extremely low under neutrality and is independent of temporal changes in population size (see also Yang et al., 2018 for an extension to a varying size population, and Ferretti et al., 2017, who provided a systematic analysis of the impact of the waiting times and the branching order of coalescent events on the SFS). One challenge, however, arises from the fact that the required imbalance must be strong in a large sample. In addition, the power of the method is naturally weakened when a selective sweep is partial and incomplete.

The long-range haplotype (LRH) test is also commonly used for detecting sweep signals (Sabeti et al., 2002; Vitti et al., 2013). This approach relies on undisrupted long-range associations with neighboring polymorphisms, or the relationship between an allele’s frequency at a focal site and the extent of linkage disequilibrium (LD) surrounding it (Sabeti et al., 2002), and may be classified into “haplotype” tests in contrast to SFS tests (Zeng et al., 2006). However, it appears that LRH-based methods tend to miss such signals that are localized in small genomic regions due to nearby recombination hotspots.

In the present paper, we develop a new method that permits the detection of ongoing selective sweeps. Although a sweep can be either “hard” (Maynard Smith and Haigh, 1974; Kaplan et al., 1989) or “soft” (Innan and Kim, 2004; Hermisson and Pennings, 2005; Przeworski et al., 2005), we here presume a soft sweep for generality as it can accommodate the theoretical framework to the case of a single variant on which positive selection starts to act. In particular, we are concerned with the situation under which the size of two basal branches in genealogy may not be particularly distorted, but wherein one descendant allelic lineage has expanded rapidly. We demonstrate by simulation that such lineage-specific expansion is seldom observed under the standard neutral model (a low false-positive rate) as well as even under certain demographic models of neutral variants subjected to population bottleneck and expansion. Although we do not address the reverse problem that positive selection may severely bias estimates of demographic parameters (for which readers may refer to Schrider et al., 2016), we examine whether our method can equally exhibit a low false-negative rate when positive selection in fact operates. For this purpose, we conduct coalescent simulation with selection and further apply our method to well-known cases of selective sweeps.

AN INFERENCE METHOD

We consider S bi-allelic single nucleotide polymorphism (SNP) sites in a sample of n homologous chromosomes randomly taken from a population. For simplicity, these SNP sites are numbered from the left end of a genomic region under study. To summarize such data, it is convenient to define an indicator matrix {χkl}, the (k,l) element of which takes the value of 1 if the k-th SNP site of the l-th chromosome (1 ≤ kS and 1 ≤ ln) possesses a derived allele, otherwise being assigned as 0 (Table 1). The number of derived alleles at the k-th SNP site in a sample is then counted simply as   

n k = l=1 n χ kl . (1)
If nk = 1, the site k is a singleton (or the site has only one derived allele segregating in a sample) and the total number of such sites across the sampled genomic region is denoted by ξ1. Likewise, if nk = 2, the site k is a doubleton (or the site has two segregating derived alleles) and the total number of such sites is denoted by ξ2. Here we follow Fu (1995), who used the symbol ξi (1 ≤ in−1) to express the random number of SNP sites with each exhibiting i derived alleles segregating in a sample: ξi = the number of SNP sites at each of which nk = i for 1 ≤ kS. By definition, the total number of segregating sites is equal to the assumed number of SNP sites and expressed as S= i=1 n-1 ξ i and the total number of derived alleles is expressed as L= i=1 n-1 ξ i where ξ i =i ξ i is the scaled SFS (Fu, 1995; Ronen et al., 2013). Under the standard neutral model of constant effective population size Ne, the mean of ξi is given by E{ξi } = θ/i for all possible values of i (Sawyer and Hartl, 1992; Charlesworth and Charlesworth, 2012) and the mean of ξ i by E{ ξ i }=θ where θ = 4Neu and u is the mutation rate per region per generation. The means of S and L therefore reduce to E{ S }=θ i=1 n-1 1/i=θ a n and E{L} = θ(n−1) (Watterson, 1975; Zeng et al., 2006). Moreover, we have E{Π} = θ for the pairwise mean of nucleotide differences, defined as Π= 2 n( n-1 ) i=1 n-1 i( n-i ) ξ i (Nei and Li, 1979; Tajima, 1989) and E{H} =θ for H= 2 n( n-1 ) i=1 n-1 i 2 ξ i (Fay and Wu, 2000; Ferretti et al., 2017). Notably, these mean values are all independent of recombination rates, although the variances are not.

Table 1. Example of the site frequency spectrum {ξi} = (ξ1 = 2, ξ2 = 3, ξ3 = 2, ξ4 = 2, ξ5 = 0) with the total number of segregating sites S = 9 and the sample size n = 6.
SNP sites
core
123456789
chr 1101101111
chr 2001001111
chr 3000001110
chr 4010110011
chr 5000010001
chr 6000000000
nk112223344
nrk102103332
gk343453345

The core SNP site (r) is arbitrarily set at r = 6 with nr = 3. The values of nk and nrk for 1 ≤ k ≤ 9 determine the barcodes at these nine SNP sites, retaining their spatial positions. The conditions for recombinants are nrk > 0, nk > nrk, nr > nrk and n > gk = nk + nrnrk as in Eq. (3a) in the text.

In a large sample, it is likely that ξi = 0 for many large i values. For this reason, we bin individual SNP sites into the following eight classes, although the number of such frequency classes depends on sample size n. Because of its preponderance, the first class (c1) consists solely of singletons and is expected to have the mean of E{ξ1} = θ (Fu and Li, 1993). The second class (c2) consists of doubletons and tripletons with E{ξ2 + ξ3} = 5θ/6. Likewise, the remaining classes are c3 with E{ i=4 9 ξ i }θ , c4 with E{ i=10 25 ξ i }θ , c5 with E{ i=26 68 ξ i }θ , c6 with E{ i=69 185 ξ i }θ , c7 with E{ i=186 503 ξ i }θ , and c8 with E{ i=504 1007 ξ i }=0.69θ if n = 1008 as for the East Asia meta-population in the 1000 Genomes Project database (1000 Genomes Project Consortium, 2015). However, if n ≥ 1369 and ≥ 3720, we additionally need classes up to c9 and c10, respectively, whereas if n = 68, we need only the first five classes. In any case, we sort each individual SNP site k or SFS into these classes with nearly equal weight. This strategy has been previously utilized, for example, in estimating the distribution of fitness effects for amino acid-altering mutations (Eyre-Walker et al., 2006). In our method, however, this hierarchical binning of SFS plays an essential role in extracting information on “intra-allelic variability” (Slatkin and Rannala, 1997). In this context, it is instructive to recall that the higher the allele frequency, the older the mean age of an allele (Kimura and Ohta, 1973; Slatkin and Rannala, 2000).

In human populations, the expected rough uniformity of SFS over the eight classes is often severely violated. The SFS generally reveals an excess of rare as well as intermediate or high allele frequencies, thereby supporting demographic models of recent population growth and past bottlenecks, in particular in non-African populations (e.g., Schaffner et al., 2005; Gutenkunst et al., 2009; Liu and Fu, 2015; Terhorst et al., 2017). Alternatively, positive selection may leave similar signatures at linked neutral sites (e.g., Kaplan et al., 1989; Braverman et al., 1995; Hermisson and Pennings, 2005; Evans et al., 2006; Ronen et al., 2013; Schrider and Kern, 2017). In the following analysis, we inquire whether and how we can isolate the effect of positive selection from that of demographic causes.

We pay special attention to one particular SNP site, henceforth termed a core site or the r-th SNP site, which might be a target of positive selection. This core site is assumed to be given or of interest, and does not require any a priori knowledge about its evolutionary aspects. Our aim is to examine whether there exists a genomic region hitchhiking and in linkage disequilibrium (LD) with derived alleles at a core site. As a rule, the tighter the linkage, the stronger the hitchhiking effect; also, the broader the genomic region, the more recent the positive selection. There are n r = l=1 n χ rl derived alleles at a core site. Using these nr derived alleles and nnr = l=1 n (1- χ rl ) ancestral alleles at a core site, we divide our sample of n homologous chromosomes into two mutually exclusive groups: derived and ancestral allele groups. The derived allele frequency at the core site is equal to fr = nr/n and the frequency class (cr) to which the core site belongs is also specified by nr. For example, if the site is located at r = 6 as in Table 1, we have n6 = 3, which belongs to c2, and divide the sample into the derived allele group of (1,2,3) chromosomes and the ancestral allele group of (4,5,6) chromosomes. This definition of allele groups and the following procedure can be extended, without any change, to the case where more than one SNP site determines two core haplotypes. Figure 1 shows one such example in which two core haplotypes, CGC vs. non-CGC, are defined at three SNP sites (rs3759916, rs3759915 and rs3759914) that are located in the promoter region of the human sialyltransferase (ST8SIA2) locus (Fujito et al., 2018). This gene is schizophrenia-associated and expressed preferentially in the brain, with the level being largely determined by the promoter SNPs. It is suggested that the expression level is a genetic determinant of schizophrenia risk and a non-risk SNP type (CGC-type) has significantly reduced promoter activity. In Asia and Europe, the major promoter type is not CGC but TCT.

Fig. 1.

Barcodes for the numbers of derived alleles (C6 A and C7 A) and r2 values for the extent of linkage disequilibrium (C6 B and C7 B) in classes c6 and c7 at the ST8SIA2 locus. Magenta bars represent the number (nrk) of derived alleles at the k-th SNP site that are linked in the core CGC haplotype, whereas gray bars represent the number (nknrk) of derived alleles that are linked in the core non-CGC haplotypes, as described in the text. The height of each bar nk in c7 ranges from 186 to 503 and that of each magenta bar ranges from 0 to 349 according to the number of CGC chromosomes in East Asians. The abscissa spans approximately 54 kb and the location of the core CGC sites is indicated by a single asterisk at the bottom (Fujito et al., 2018).

Next, we compute the number (nrk) of derived alleles at the k-th SNP site that simultaneously occur with derived alleles at the core site r. This number becomes non-zero only when both χkl = 1 and χrl = 1 hold true for some chromosome l. The nrk is thus counted as   

n rk = l=1 n χ rl χ kl (2)
and provides a measure of pairwise LD between the core site and any other neighboring SNP site. The nk in Eq. (1) ranges from 1 to n−1 and the nrk in Eq. (2) ranges from 0 to nk or nr, whichever is smaller. In Fig. 1, these numbers are represented as “barcodes” that depict SNP information by two-tone colored bars.

The barcode representation of nk and nrk differs from the traditional SFS in that it preserves spatial information about SNP sites and is stratified in eight layers that are on average in proportion to their ages. Figure 1 depicts derived alleles in c6 and c7 of ST8SIA2. In a central region in c7, 12 equally tall red bars for nrk derived alleles are linked in the derived allele group; these sites may also contain a small number of derived alleles that are linked in the ancestral allele group (gray bars for nknrk derived alleles). These barcodes indicate that the central region is in tight linkage with respect to the core site and the 12 SNP sites show 12 mutations that have accumulated in the common ancestral lineage of nrk chromosomes. Later, it is more rigorously determined that this central LD region is approximately 16 kb long and ranges from 24 kb to 40 kb distant from the left end. When the central region is extended to left and right regions with each being approximately 19 kb long, it immediately becomes apparent that derived alleles in these extended regions tend to be linked with not only derived but also ancestral alleles at the core site. Under the infinite-sites model with recombination (Kimura, 1969), there are only two possibilities that can result in such allele configurations at two sites. One possibility is that derived alleles at the k-th site are of ancient origin and can therefore appear together with the derived alleles at the core site (e.g., k = 8 in Table 1). In this case, we have nrk = nr < nk. The other is that derived alleles at the k-th site experienced recombination with respect to the core site. In this case, we have nrk < nr and nk. More thoroughly, the conditions for the latter case of recombination are that all of l=1 n χ rl χ kl , l=1 n (1- χ rl ) χ kl , l=1 n χ rl (1- χ kl ) and l=1 n (1- χ rl )(1- χ kl ) are strictly positive as in the four-gamete test (Hudson and Kaplan, 1985):   

n rk >0,    n k > n rk ,    n r > n rk       and      n> g k = n k + n r - n rk . (3a)
In the example given in Table 1, only the two sites k = 4 and 9 satisfy these conditions with respect to the core site r = 6. In the barcode representation, the first two inequalities in Eq. (3a) are indicated by a two-tone colored bar at the k-th site, whereas the third is indicated by the difference between two red bars at the core site and the k-th site, and the last is not depicted explicitly but corresponds to the case where nnr ancestral alleles at the core site are greater in number than derived alleles in a gray bar at the k-th site. This allele association between the core site r and the k-th site is commonly measured by the following LD formulae:   
D= f rk - f r f k       or       r 2 = D 2 f r f k ( 1- f r ) ( 1- f k ) (3b)
where frk = nrkn, fk = nkn and fr = nrn as before. The above “r” in r2 (Hill and Robertson, 1968) should not be confused with the same symbol r used for the position of a core site or for the recombination rate given in the Discussion section. The |D| or r2 in Eq. (3b) is useful to grasp an outline of a core region: it constitutes a high LD region that includes a core site and exhibits a low level of derived allele polymorphism. Figure 1 also plots pairwise r2 values in c6 and c7, revealing that the central region exhibits strong LD with the core site and is bounded by abruptly declined LD regions on both sides. Of importance is the examination of LD in individual classes within each of which SNPs are age-related and tend to disclose the linkage relationships. To further delimit the boundaries of the core region, we carried out exhaustive window analyses of Fc and determined the region that showed the minimum value.

In Fig. 2, we draw the entire barcode representation of the same genomic region as in Fig. 1 with the CGC haplotype and the alternative TCT haplotype as a respective core, and the regions on both sides are used for comparison with the core region and the size is taken as being almost the same as the core region. Two features emerge. Signals for tight linkage in the core region are characteristic to the CGC haplotype group, whereas several SNP sites in the TCT haplotype group fall under suspicion of recombination. It is therefore consistent with the view that the extent of polymorphism within the TCT haplotype group is relatively high and the age is relatively old. Conversely, the pattern and extent of derived allele polymorphism in the CGC haplotype group are qualitatively and quantitatively different from those in the TCT haplotype group, suggesting rapid lineage-specific expansion of the former. This difference is captured by contrasting values of Fc, which is defined below.

Fig. 2.

Entire barcode representation of the SNPs in a 54-kb region at the ST8SIA2 locus. The three core SNPs are located in the right middle (an asterisk in each class) and define the CGC and TCT haplotype groups with frequency 349/1008 = 35% and 515/1008 = 51%, respectively, in East Asians (the 1000 Genomes Project database). Note that according to the copy number of derived alleles at a site, the vertical scale differs greatly among different classes: c1 (top) to c8 (bottom).

Provided that we have identified a core region, we are now in a position to quantify such lineage-specific expansion by an appropriate measure. We define Fc as the ratio of the number of derived alleles in the derived allele group to that in the entire sample, where:   

F c = kc n rk kc n k (4)
for nrk and nk at all the SNP sites in certain frequency classes (c). The numerator and denominator are calculated in the same genomic region so that one can act as an internal control of the other. However, it is strongly recommended to exclude not only the own class (cr) to which the core site belongs, but also the higher-than-own classes altogether, because these classes contain overwhelming numbers of derived alleles that accumulated in the common stem lineage of a derived allele (haplotype) group and overshadow useful information regarding the internal structure of descendant lineages. For this reason, we may rewrite Eq. (4) as F<cr to express explicitly that it is computed in all the classes lower than the core allele frequency class. This Fc or F<cr quantifies intra-allelic variability in the sense of Slatkin and Rannala (1997, 2000) without invoking any prior knowledge regarding intra-allelic genealogy. It is shown by simulation that the Fc statistic in Eq. (4) satisfactorily reflects genealogical structure within a derived allele group. It is important to realize that Fcfr if no linkage relationship is expected between the core site and all other sites under consideration.

Table 2 shows the estimates of Fc in each of the eight frequency classes of ST8SIA2 among East Asians in the 1000 Genomes Project database. These F ˆ c are obtained separately in the central core region and, for comparison, in the neighboring region of 38 kb length as well (the rightmost column in Table 2). Here and subsequently, a caret over a statistic indicates an estimate. As the copy number of the CGC haplotype is 349 in the present sample of n = 1008, the core haplotype belongs to c7 so that Eq. (4) is applied to all the classes lower than c7. Our estimate of F ˆ < c 7 = 41 2802 0.015 implies that within c1 through c6, the CGC haplotype group contains only 1.5% of derived alleles relative to the whole in the corresponding frequency classes. Comparison of this estimate with F ˆ < c 7 = 1513 6305 =0.240 in the neighboring region indicates that although the latter is not yet close to the CGC haplotype frequency in East Asians (0.35), it is much higher than that in the core region, suggesting that both sides are apparently in loose linkage with the core haplotype. It is also noted that F ˆ < c 7   ≈ 0.015 in the CGC haplotype group is in sharp contrast to F ˆ < c 8 = 0.349 in the TCT haplotype group (Fig. 2). Although the large F ˆ < c 8 is due partly to the high core frequency f ˆ r = 0.51 of the TCT haplotype, the rest is attributed to a large number of TCT-linked derived alleles in c6 and c7. In contrast, the extremely low F ˆ < c 7   value in the CGC haplotype stems from the virtual absence of CGC-linked derived alleles that belong to c4 to c6 despite abundant non-CGC-linked derived alleles (Table 2). In terms of genealogy, the deficiency of F ˆ c 4 to F ˆ c 6 corresponds to the presence of a large cluster without any solid family relationship among its members after they originated from a common ancestor. We take this unusually unstructured pattern and lack of derived alleles as evidence for rapid expansion of the CGC haplotype lineage. Exceptions are a few SNP sites at which there are one and two CGC-linked derived alleles in total in c5 and c6, respectively (Table 2). As these CGC-linked derived alleles occur together with a large number of non-CGC-linked derived alleles at the same SNP sites, recombination is the most likely cause of these minor associations. Thus, the central core region we have just identified may not be perfectly free from recombination. Such imperfection is practically inescapable in actual data analyses and thus prompts us to be conservative in testing a null hypothesis.

Table 2. Proportions of derived alleles that are linked to the CGC haplotype (nrk) at the human ST8SIA2 locus. Data are taken from East Asians (n = 1008) in the 1000 Genomes Project database. F c = kc n rk / kc n k where c stands for specified classes of allele frequencies.
F ˆ c Entire 54-kb region1Central 16-kb region2Both sides3
F ˆ c 1 61/180=0.3415/51=0.2946/129=0.36
F ˆ c 2 44/154=0.2915/42=0.3629/112=0.26
F ˆ c 3 46/242=0.188/104=0.0838/138=0.28
F ˆ c 4 77/441=0.170/64=0.0077/377=0.20
F ˆ c 5 81/355=0.231/102=0.0180/253=0.32
F ˆ c 6 1245/7735=0.162/2439=0.0011243/5296=0.24
F ˆ c 7 12312/29184=0.424198/5853=0.728114/23331=0.35
F ˆ c 8 7799/23454=0.332541/9091=0.285258/14363=0.37
Total21665/61745=0.356780/17746=0.3814885/43999=0.34
1  The 54-kb region is composed of the central region and both side regions.

2  For the definition of the core (central) region, see the text.

3  Both sides implies the left and right side from the core. The size of each side region is approximately the same as that of the core.

Equally importantly, the observed F ˆ c values in low frequency classes are much higher than those in high frequency classes (Table 2). In fact, we have F ˆ c 1-2 = 30 93 =0.323 in classes c1−2 in contrast to F ˆ c 3-6 = 11 2709 =0.004 in classes c3−6. This increased level of rare variants may stem from the fact that when new mutations begin to accumulate within the derived allele group, the rate of return to equilibrium is faster in lower frequency classes than in higher frequency classes (Satta et al., 2018). Alternatively, the rate of return to equilibrium may suffer from recombination and/or the phasing problem for rare variants unless long-read sequencing is applied (cf. Field et al., 2016). In either case, inclusion of such rare variants inflates the estimate of F ˆ < c 7   to the level of 0.015. Before going further, we simulated a model of selective sweeps and computed F c 1-2 and F c 3-6 separately to evaluate the effect of new mutations. We found that the ratio of F c 1-2 to F c 3-6 seldom exceeds the observed level of about 79 for a wide range of recombination rates (data not shown). Henceforth, it appears that the large value of F ˆ c 1-2 relative to F ˆ c 3-6 can be attributed to inaccurate haplotype phasing. Moreover, it is worth mentioning that under the standard neutral model, the F < c 7 statistic is fairly insensitive to rare variants. For instance, the 1% threshold value (0.0346) of F c 3-6 obtained under neutrality is approximately the same as 0.0384 of F < c 7 . This reflects the fact that under such a null equilibrium model, each frequency class contributes almost equally to F < c 7 . In other words, F c 3-6 and F < c 7 exhibit similar false-positive rates in rejecting neutrality (see below), and the use of F ˆ < c 7   instead of F ˆ c 3-6 may make the test conservative, depending on the quality of sequence data we use.

TESTING NEUTRALITY

To evaluate the false-positive rate (the probability α of a type I error) in our method, we performed ms (Hudson, 2002) and/or fastsimcoal (Excoffier and Foll, 2011) using the standard neutral model of constant Ne together with several demographic models proposed for modern human prehistory (e.g., Schaffner et al., 2005; Gutenkunst et al., 2009; Liu and Fu, 2015; Terhorst et al., 2017). In doing so, we restricted our examination to demographic models inferred for the East Asian or European meta-population and ignored migration with other continental populations. Moreover, these models differ from each other in detail. For instance, the model of Terhorst et al. (2017) is somewhat different from that of Schaffner et al. (2005) in the strength and timing of bottlenecks even if we discretize the population size estimated in the former. Accordingly, here we use two simplified versions of Schaffner et al. (2005) and Terhorst et al. (2017) that are subsequently referred to as SFD and TFD, respectively (Supplementary Fig. S1).

In simulating a core LD region of ST8SIA2, we assume no recombination and set an estimated mutation rate in a core region ( θ ˆ ) or the observed number of segregating sites ( S ˆ ). The choice between θ and S does not matter under the standard neutral model as long as S = θan approximately holds and both parameters are sufficiently large (e.g., S ≥ 100); however, the choice may make a difference under the SFD and TFD models. For these models of changing population size, we use the observed S value to avoid unnecessary complications. First, we examined the F < c 7 statistic under the standard neutral model and carried out 1,000 replications with a specified range of derived allele frequencies at a core site (0.35 ± 0.015). The distribution of F < c 7 becomes unimodal with the mean and standard deviation (SD) of 0.26 ± 0.15 (Fig. 3). The threshold value of F < c 7 for α = 0.01 is 0.038 and the observed value of F ˆ < c 7   ≈ 0.015 is significantly small with α < 0.001. We can thus reject the standard neutral model with a very low false-positive rate.

Fig. 3.

Distribution of F < c 7 (1 ≤ nk ≤ 185) under the standard neutral model, provided that S = 146 and the frequency (fr) of derived alleles at a core site is within 0.35 ± 0.015. With over 1,000 replications, the mean and standard deviation of Fc are 0.255 ± 0.147.

Under the SFD model, we obtained the distribution of F < c 7 with the mean and SD of 0.24 ± 0.18, finding that it becomes somewhat broad and skewed toward 0. Because of this and if the SFD model is treated as a null hypothesis, the false-positive rate increases; however, the threshold value of F < c 7 for α = 0.01 is as large as 0.026 and F ˆ < c 7   ≈ 0.015 is still significantly small (α < 0.002). Likewise, under the TFD model, we obtained the distribution of F < c 7 with the mean and SD of 0.24 ± 0.19 and found that the threshold value for α = 0.01 is 0.017 and that α < 0.004 for F ˆ < c 7   . Hence, changing population size indeed increases the false-positive rate, but the effect on the Fc statistic is surprisingly small. Conversely, if the SFD and TFD models are treated as alternatives, the power of the Fc statistic defined by 1 − β (the probability β of a type II error) is too low to detect the effect in comparison with the standard neutral model. In fact, setting α = 0.05 with the corresponding threshold value of F < c 7 = 0.07, we found that β > 0.8 under the SFD and TFD models. This high probability of a type II error or the low power reflects the broad and largely overlapping distributions of Fc between the standard neutral model of constant size and demographic models of changing population size (Fig. 4). In our subsequent analyses, we thus use the traditional SFD model as a null hypothesis rather than the standard neutral model unless otherwise specified.

Fig. 4.

Distribution of Fc under the Schaffner et al. (2005) fluctuating drift (SFD) model (blue) in comparison with that under the standard neutral model (gray): fr = 0.35, S = 146, no recombination, and n = 1008. The command lines in ms (Hudson, 2002) are: ./ms 1008 4000 -s 160 for the standard neutral model and ./ms 1008 4000 -s 160 -eN 0.001 0.077 -eN 0.004745 0.007 -eN 0.004995 0.077 -eN 0.0084975 0.006 -eN 0.0087475 0.24 -eN 0.0425 0.125 for the SFD model.

UNDER A SELECTIVE SWEEP

To examine the statistical power of Fc when a selective sweep is an alternative, we carried out coalescent simulation with selection at a single site that incorporates a selected allelic lineage into the neutral genealogy background. We used discoal, which generates either deterministic or stochastic trajectories for selective sweeps (Kern and Schrider, 2016); readers may also refer to mbs by Teshima and Innan (2009) and MSMS by Ewing and Hermisson (2010). For a given value of the population-scaled selection intensity Nes, discoal efficiently simulated soft sweeps for a given initial frequency f0 and a given current frequency fr of a selected allele (Supplementary Fig. S2).

To imitate a putative soft sweep at ST8SIA2 (Fujito et al., 2018), we set f0 as 0.008 (the CGC haplotype frequency in Africans) and fr as 0.35 (the CGC haplotype frequency in East Asians). As expected, the distribution of Fc becomes narrow, and for Nes = 200 and θ = 20, it concentrates in a small range of 0.00053−0.21 with the mean and SD of 0.020 ± 0.024 (Fig. 5). The threshold value for α = 0.01 is accordingly as small as 0.001. To evaluate the power, we further examined various values of Nes. When compared with the mean and SD of Fc (0.24 ± 0.18) for Nes = 0, they sharply decrease to 0.077 ± 0.042 for Nes = 50, 0.050 ± 0.028 for Nes = 100, and 0.039 ± 0.029 for Nes = 150 (Supplementary Fig. S3). Figure 6 shows the power of the Fc statistic against Nes for three type I error rates (α = 0.1, 0.05, and 0.01 under the null hypothesis of the standard neutral model). For α = 0.01, the Fc statistic can exhibit the power of 1 − β = 0.64 if Nes is 100. As Nes increases up to 200, the power further increases to > 0.78 by proportionally reducing the type II error rate β. Conversely, if Nes is ≤ 10, the power remains poor and the Fc statistic likely fails to detect such weak selection signals.

Fig. 5.

Distribution of the Fc statistic under a selective sweep model (pink) in comparison with the SFD model (blue). The parameters and conditions for a soft sweep by genic selection are set as follows: n = 1008, 1,000 replications, region size = 16 kb, no recombination, θ = 20, Nes = 200, initial frequency f0 = 0.008 and current frequency fr = 0.35. The command line in discoal (Kern and Schrider, 2016) is: ./discoal 1008 1000 16000 -t 20 -ws 0 -a 400 -x 0.5 -f 0.008 -c 0.35.

Fig. 6.

Power of the Fc statistic (ordinate) against Nes (abscissa) and the three curves for three α values of 0.1 (green), 0.05 (blue) and 0.01 (red). These are evaluated by discoal with the same command line as in Fig. 5. Genic selection follows stochastic frequency trajectories with selection intensity s and begins at initial frequency f0 at time t0 and ends at current frequency fr at time tr. The time required to change the variant frequency from f0 to fr is well approximated by t= t r - t 0 2 s ln ( 1- f 0 ) f r ( 1- f r ) f 0 8.4 s .

We further applied our method to three human genetic loci that are known to have undergone hard selective sweeps. The best known is perhaps the locus for the adaptation of lactase (LCT) persistence that has occurred independently in Africans and Europeans (Tishkoff et al., 2007). Surveying a 1-Mb region surrounding rs4988235 (Supplementary Fig. S4A), we found that although the whole region under the hitchhiking effect may exceed 500 kb, a 50-kb central region is in extremely tight linkage and exhibits marked deficiency of core-linked derived alleles in the European meta-population. Because the current frequency (   f ˆ r ) of the core derived allele is 0.51, the core allele group belongs to c8 and the F ˆ < c 8 within the 50-kb core region is calculated as 0.006. To assess the statistical significance of the F ˆ < c 8 with the   f ˆ r , we simulated the SFD model for the European meta-population and a hard sweep model as an alternative hypothesis. The SFD simulation with S = 263 observed in the core region yielded a threshold value of F < c 8 for α = 0.01 at 0.032, which indicates that the observed value of F ˆ < c 8 = 0.006 is too small to be compatible with the SFD model (α < 0.001). Conversely, the sweep model of genic selection with θ = San = 35 and Nes = 200 yielded β = 0.218 for α = 0.01, suggesting that the hard sweep on LCT is much stronger than expected from Nes = 200 (Fig. 7).

Fig. 7.

Cumulative distribution of the Fc statistic under the neutral SFD model (blue) and the hard sweep model with Nes = 100, 200 and 300 for LCT, OCA2, EDAR and ASPM. Simulation with 1,000 replications is performed for a fixed S value under the neutral SFD model, whereas it is performed for a specified θ value under the sweep model. A red triangle at the bottom of each panel shows an observed Fc value.

As the second example, we studied oculocutaneous albinism type II (OCA2) eye color variation (rs1800414) with   f ˆ r = 0.60 in Asians (e.g., Sturm and Duffy, 2012). This derived allele is also associated with skin pigmentation and acts as a skin-lightening allele under epistatic interactions with the melanocortin-1 receptor gene. As with LCT, we first performed the barcode analysis for an approximately 1-Mb region surrounding rs1800414 (Supplementary Fig. S4B). The core region becomes 49 kb long and yields F ˆ < c 8 = 0.086. We simulated the SFD model with the   f ˆ r and S = 320 for the East Asian meta-population as well as the same sweep model with θ = 43 and Nes = 200. For α = 0.01, the threshold value of F < c 8 is 0.041, which is smaller than the observed value of F ˆ < c 8 = 0.086 (Fig. 7). Conversely, the F ˆ < c 8 becomes too high to reject the SFD model at the 1% level. Thus the statistic on OCA2 is only suggestive of positive selection.

As the third example, we examined the gene encoding ectodysplasin A receptor (EDAR), which is associated with Asian hair thickness (Fujimoto et al., 2008). At the core site 1540 (rs3827760) within the coding region, the Asian-specific derived allele C resulted in a non-synonymous change and increased its frequency to   f ˆ r = 0.85−0.90 in CHB (Han Chinese Beijing) and JPT (Japanese Tokyo) populations, most likely after East Asians diverged from Europeans. The slow rate of LD decay measured by extended haplotype homozygosity (EHH) in Sabeti et al. (2002) and high extents of population differentiation measured by FST both strongly supported an ongoing selective sweep (Fujimoto et al., 2008). Consistent with this, we found a 73-kb core region with F ˆ < c 8 = 0.196 (Supplementary Fig. S4C) and noted that this relatively high value of the F ˆ < c 8 is correlated with the high incidence of the 1540C allele in the Chinese and Japanese populations. The simulation revealed that for α = 0.01, the threshold value of F < c 8 is 0.233 under the SFD model with S = 543 and β = 0.107 under the sweep model with θ = 72 and Nes = 200 (Fig. 7). Thus, the F ˆ < c 8 is significantly small (α = 0.004) and supports the notion that the hard sweep on EDAR is not much weaker than expected from Nes = 200.

DISCUSSION

In our illustrative example, we examined the significance of the Fc statistic under the assumption of a fixed core allele frequency fr = 0.35. However, as with EHH (Sabeti et al., 2002) and other statistics of the LRH (Vitti et al., 2013), Fc also depends noticeably on fr. It is therefore necessary to carry out appropriate simulation studies to evaluate the statistical significance of observed Fc and/or to examine the empirical distribution for a number of its realized values in a genome (see Fujito et al., 2018 for comparison of various methods in their statistical power when applied to ST8SIA2). Nevertheless, to obtain a rough idea, we considered it useful to tabulate the fr dependency of threshold values of Fc and other related statistics under the standard neutral model. In making such a table, we further assumed fairly large values of θ or S to reduce stochastic errors that might be induced by low mutation rates and, as a consequence, become more appropriate for EDAR with a larger core region than that of ST8SIA2. Table 3 shows that the threshold value of Fc is very sensitive to fr while simultaneously illustrating the atypical characteristic of LCT with   f ˆ r = 0.51 and F ˆ < c 8 = 0.006. In general, if fr ≤ 0.6 or fr ≥ 0.7, the mean ( F ¯ c ) of Fc is greater or smaller than the median, and the distribution is positively or negatively skewed accordingly.

Table 3. fr dependency of the threshold values of the Fc statistic for specified type I error rates α and the same dependency of the mean and standard deviation (SD). As a window for a specified value of fr must be set, it is assumed to be f r ± f r ( 1- f r ) /n . Under the assumption of the standard neutral model with n = 1008 and θ = 100 or S = 1000, the Fc is computed for <c6 when fr = 0.1, for <c7 when fr = 0.2 to 0.4 and for <c8 when fr ≥ 0.6. The exception is the case of fr = 0.5, for which the average of Fcs is taken over realized values for <c7 and <c8. For each set of parameters, the number of replications is 1000. The differences between the two tables can likely be attributed to replication errors.
θ = 100
αfr
0.10.20.30.40.50.60.70.80.9
0.010.0090.0170.0370.0640.0740.0960.1470.2450.387
0.050.0160.0280.0560.0970.1160.1510.2190.3160.517
0.10.0210.0370.0730.1210.1430.1810.2630.3820.574
0.50.0540.1010.1680.2580.2790.3450.4440.5770.746
mean0.0720.1270.1970.2830.3000.3450.4400.5620.723
SD0.0620.0960.1200.1450.1360.1220.1320.1300.111
S = 1000
αfr
0.10.20.30.40.50.60.70.80.9
0.010.0100.0170.0390.0710.0820.1090.1250.2220.428
0.050.0170.0290.0620.1010.1170.1590.2060.3270.527
0.10.0230.0390.0750.1220.1430.1930.2540.3860.585
0.50.0530.0990.1720.2620.2800.3430.4380.5780.753
mean0.0680.1270.2070.2900.3000.3510.4360.5660.733
SD0.0530.0970.1320.1460.1320.1210.1360.1330.106

Although the distribution of Fc is generally very broad under neutrality, it may be claimed that F ¯ c < fr can be expected only when recombination is rare or absent, because only particular branches in a coalescent tree can produce a cluster of chromosomes with specified frequency fr (Fu, 1995; Li, 2011; Ferretti et al., 2017; Yang et al., 2018) and mutations in the common stem lineage are not counted in Fc. To illustrate this, let us consider a simple case of n = 4 (Supplementary Fig. S5). For fr = 3/4, we can uniquely determine the tree topology in which only one suitable internal branch satisfies the frequency condition. Assuming the number of nucleotide changes is ideally in proportion to the average coalescent branch length, we can compute F ¯ < c 3 = 0.69 rather than 3/4. Similarly, for fr = 1/2, we obtain F ¯ c 1 = 0.38 rather than 1/2. Hence, F ¯ c is slightly smaller than fr in a coalescent tree. However, it appears that the difference becomes even larger for a large sample size n and for a high current frequency fr (Table 3). Alternatively, if recombination occurs frequently, F ¯ c should approach fr because a core site and any other site tend to be in linkage equilibrium regardless of frequency classes. Simulation confirmed this assertion (Supplementary Fig. S6). Conversely, the observation of a large difference between fr and Fc thus constitutes a first favorable signal of LD being worthy of further scrutiny.

When a derived allele at a core site increases in frequency and is eventually fixed in a population, we can no longer compute the Fc statistic. SLC24A5 is one of two known loci strongly affecting light skin pigmentation in Europeans and shows a fixed derived allele in the CEU population (Utah residents with Northern and Western European ancestry; Olalde et al., 2014). Although the variant is fixed in the population, the hierarchical barcodes of an approximately 20-kb region surrounding rs1426654 still indicate marked deficiency of SNPs in c6 and c7 in addition to a notable recovery of polymorphism in the lower classes (data not shown). As these features are likely footprints of a complete selective sweep, it is both worthwhile and possible to quantify such characteristics by some other measures than Fc. Kimura et al. (2007) considered two measures of the LRH and developed a method for when a selected allele has reached fixation in a local population (see also Maynard Smith and Haigh, 1974; Kaplan et al., 1989; Ronen et al., 2013; Schrider and Kern, 2017). Alternatively, we may apply the Fc statistic in a polymorphic population such as South East Asians (SAS), although this may not work for the case of local adaptation. In SAS, we actually found a 134-kb core region with   f ˆ r = 0.60 and F ˆ < c 8 = 0.136. Although the probability of a type I error (α = 0.03) may not be sufficiently small, this finding suggests that positive selection has acted on SLC24A5 in SAS as well.

In most simulations, we have assumed no recombination within a core region, not addressing the effect of recombination on neighboring regions. Clearly, it is worth demonstrating the effect on both the barcode representation and the Fc statistic. For this purpose, we ran discoal with recombination under θ = 20, Nes = 200 and   f ˆ r = 0.35. Although it is necessary to assume a rather small sample size of n = 200 or less for technical reasons, the c6 barcode thus generated reveals patterns similar to the c7 barcode in Fig. 1 (Supplementary Fig. S7). We also confirmed that the F < c 6 becomes approximately 0.01 in the central region, but increases to 0.34 on both sides. In short, simulation with selection and recombination can reproduce the pattern and level of polymorphism that resemble those observed in some human genomic regions under selective sweeps. More intriguingly, we found that the Fc statistic is little affected even when the recombination rate (r) per region is fairly high (e.g., ρ = 4Ner ≥ 10). This unexpected robustness certainly depends on how to evaluate intra-allelic variability under selective sweeps. As mentioned earlier, the Fc statistic is intended to exclude SNP mutations with nknrk = nr at site k because they have accumulated in the stem lineage of a derived allele group. However, recombination, if it occurs in one or a few lineages in the group, may act so as to incorporate some of those stem-lineage mutations into intra-allelic variability and substantially enhance it: this is because although the number of remaining unrecombined lineages is smaller than nr, they can be still large in size (large nrk). Fortunately, it is likely that SNP sites thus created within the derived allele group remain in the same frequency class as the core site cr. In this case, the Fc statistic in Eq. (4) simply ignores those nrk mutations from intra-allelic variability.

Next, we briefly discuss one issue concerning the distinction between hard sweeps and soft sweeps, the latter being the more dominant mode of adaptation (Schrider and Kern, 2017). As a soft sweep may drive multiple standing variants at linked neighboring sites from the beginning and retain the polymorphism to some extent even if selection has been completed, it is in general more challenging to detect a soft sweep than a hard sweep. Peter et al. (2012) concluded that whereas LCT and EDAR are subjected to hard sweeps, ASPM (abnormal spindle-like microcephaly-associated) (Mekel-Bobrov et al., 2005) is likely affected by a soft sweep. Owing to the similarity to ST8SIA2, we examined a core SNP site in ASPM with   f ˆ r = 0.41 in Europeans and found that there exists a 26-kb core region with F ˆ < c 7 = 0.052. Under the assumption of a hard sweep with θ = 20 and Nes = 200, contrary to Peter et al. (2012), the threshold value of F < c 7 becomes 0.029 for α = 0.01 and the corresponding value of β becomes 0.186 (Fig. 7). These results may be taken as evidence for positive selection on ASPM, though not very strong. However, unlike that for a hard sweep, the power (1 − β) for detecting a soft sweep depends strongly on the initial frequency f0 of a selected allele (Innan and Kim, 2004; Hermisson and Pennings, 2005; Przeworski et al., 2005). For this reason, we examined the distribution of Fc for various values of f0 under the conditions of θ = 20, n = 1008 and Nes = 200. Although the power is still as high as 0.74 when f0 = 0.01, it sharply declines as f0 further increases: 0.26 for f0 = 0.05, 0.12 for f0 = 0.1, and almost 0 for f0 = 0.2, which is nearly halfway to the current frequency fr. Without knowing the initial frequency of a selected allele, as is usually the case, it seems almost impossible to make definite conclusions about the power of a method for detecting soft sweeps (but see Satta et al., 2018 for one possibility). Nevertheless, if ASPM showed more unambiguous non-neutrality (i.e., α < 0.01) concerning intra-allelic variability, it could have also been suggested that the initial frequency of the selected allele was no greater than 1% when a soft sweep began to operate; such a low initial frequency appears consistent with the geographic distribution of the ASPM variant, which is confined to Eurasia.

As a final caveat, whereas we have shown that changing population size does not markedly affect the Fc statistic, the demographic models examined here are not exhaustive. Many other demographic models or events potentially influence the Fc statistic. These include range expansion with so-called “surfing” in a wave front that may mimic a selective sweep and be relevant to the present context (Edmonds et al., 2004; Klopfstein et al., 2006). However, this and other demographic causes such as population structure and admixture are beyond the scope of the present paper.

Source code for Fc statistic

The source code (python) for the Fc statistic is available freely from our web site (https://sites.google.com/site/sattalab/research).

ACKNOWLEDGMENTS

We thank Dr. Andy D. Kern for his kind and prompt response to our inquiry and request to utilize discoal. We also thank the editor and reviewers for their constructive criticisms, and Editage (www.editage.jp) for English language editing on early versions of this manuscript. This work was supported in part by the Japan Society for Promotion of Science (JSPS) grant 16H04821 to Y. S., the JSPS grants 23570271, 25101705 and 16K07535 to T. H., and the Scientific Research on Innovative Areas, a MEXT Grant-in-Aid Project FY2016-2020 to N. T.

REFERENCES
 
© 2018 by The Genetics Society of Japan
feedback
Top