A new inference method for detecting an ongoing selective sweep

A simple method was developed to detect signatures of ongoing selective sweeps in single nucleotide polymorphism (SNP) data. Based largely on the traditional site frequency spectrum (SFS), the method additionally incorporates linkage disequilibrium (LD) between pairs of SNP sites and uniquely represents both SFS and LD information as hierarchical “barcodes.” This barcode representation allows the identification of a hitchhiking genomic region surrounding a putative target site of positive selection, or a core site. Sweep signals at linked neutral sites are then measured by the proportion (Fc) of derived alleles within the hitchhiking region that are linked in the derived allele group defined at the core site. In measuring Fc or intra-allelic variability in an informative way, certain conditions for derived allele frequencies are required, as illustrated with the human ST8SIA2 locus. Coalescent simulators with and without positive selection are used to assess the false-positive and false-negative rates of the Fc statistic. To demonstrate its power, the method was further applied to the LCT, OCA2, EDAR, SLC24A5 and ASPM loci, which are known to have undergone positive selection in human populations. Overall, the method is powerful and can be used to identify core sites responsible for ongoing selective sweeps.


INTRODUCTION
Population genomics has identified a large number of genetic variants that are associated with either global or local adaptations of species, in particular of anatomically modern humans (e.g., Li et al., 2014;Schrider and Kern, 2017).As reviewed in Lachance and Tishkoff (2013), Vitti et al. (2013) and Fan et al. (2016), there are several approaches for detecting positive Darwinian selection from DNA polymorphisms.One approach is based on the allele frequency distribution or the site frequency spectrum (SFS) and uses either summary statistics (e.g., Watterson, 1975;Tajima, 1989;Fu and Li, 1993;Fay and Wu, 2000;Zeng et al., 2006;Field et al., 2016) or the entire allele frequency distribution (e.g., Sawyer and Hartl, 1992;Bustamante et al., 2001;Nielsen et al., 2005;Gutenkunst et al., 2009;Pavlidis et al., 2013).However, when positive selection is ongoing, the allele frequency at a selected site may not be sufficiently high, which naturally results in reduced statistical power of the SFS-based methods.It is also known that the inference of positive selection made by these SFS-based methods tends to be muddled by the confounding effect of demography if they attempt to detect deviations from the standard neutral model of constant population size.In addition, these methods often assume no recombination within a locus but free recombination between loci or their loose linkage, and are extremely sensitive to these assumptions on recombination as linkage inflates the variance of allele frequencies (Bustamante et al., 2001).Li (2011) proposed an alternative method that uses the maximum frequency of derived alleles in a sample as an indicator of unbalanced coalescent trees.This method relies on the fact that the probability of unbalanced basal branches is extremely low under neutrality and is independent of temporal changes in population size (see also Yang et al., 2018 for an extension to a varying size population, and Ferretti et al., 2017, who provided a systematic analysis of the impact of the waiting times and the branching order of coalescent events on the SFS).One challenge, however, arises from the fact that the required imbalance must be strong in a large sample.In addition, the power of the method is naturally weakened when a selective sweep is partial and incomplete.
The long-range haplotype (LRH) test is also commonly used for detecting sweep signals (Sabeti et al., 2002;Vitti et al., 2013).This approach relies on undisrupted longrange associations with neighboring polymorphisms, or the relationship between an allele's frequency at a focal site and the extent of linkage disequilibrium (LD) surrounding it (Sabeti et al., 2002), and may be classified into "haplotype" tests in contrast to SFS tests (Zeng et al., 2006).However, it appears that LRH-based methods tend to miss such signals that are localized in small genomic regions due to nearby recombination hotspots.
In the present paper, we develop a new method that permits the detection of ongoing selective sweeps.Although a sweep can be either "hard" (Maynard Smith and Haigh, 1974;Kaplan et al., 1989) or "soft" (Innan and Kim, 2004;Hermisson and Pennings, 2005;Przeworski et al., 2005), we here presume a soft sweep for generality as it can accommodate the theoretical framework to the case of a single variant on which positive selection starts to act.In particular, we are concerned with the situation under which the size of two basal branches in genealogy may not be particularly distorted, but wherein one descendant allelic lineage has expanded rapidly.We demonstrate by simulation that such lineage-specific expansion is seldom observed under the standard neutral model (a low falsepositive rate) as well as even under certain demographic models of neutral variants subjected to population bottleneck and expansion.Although we do not address the reverse problem that positive selection may severely bias estimates of demographic parameters (for which readers may refer to Schrider et al., 2016), we examine whether our method can equally exhibit a low false-negative rate when positive selection in fact operates.For this purpose, we conduct coalescent simulation with selection and further apply our method to well-known cases of selective sweeps.

AN INFERENCE METHOD
We consider S bi-allelic single nucleotide polymorphism (SNP) sites in a sample of n homologous chromosomes randomly taken from a population.For simplicity, these SNP sites are numbered from the left end of a genomic region under study.To summarize such data, it is convenient to define an indicator matrix {χ kl }, the (k,l) element of which takes the value of 1 if the k-th SNP site of the l-th chromosome (1 ≤ k ≤ S and 1 ≤ l ≤ n) possesses a derived allele, otherwise being assigned as 0 (Table 1).The number of derived alleles at the k-th SNP site in a sample is then counted simply as (1) If n k = 1, the site k is a singleton (or the site has only one derived allele segregating in a sample) and the total number of such sites across the sampled genomic region is denoted by ξ 1 .Likewise, if n k = 2, the site k is a doubleton (or the site has two segregating derived alleles) and the total number of such sites is denoted by ξ 2 .Here we follow Fu (1995), who used the symbol ξ i (1 ≤ i ≤ n − 1) to express the random number of SNP sites with each exhibiting i derived alleles segregating in a sample: ξ i = the number of SNP sites at each of which n k = i for 1 ≤ k ≤ S. By definition, the total number of segregating sites is equal to the assumed number of SNP sites and expressed as and the total number of derived alleles is expressed as i is the scaled SFS (Fu, 1995;Ronen et al., 2013).Under the standard neutral model of constant effective population size N e , the mean of ξ i is given by E{ξ i } = θ /i for all possible values of i (Sawyer and Hartl, 1992;Charlesworth and Charlesworth, 2012) and the mean of c where θ = 4N e u and u is the mutation rate per region per generation.The means of S and L therefore reduce to and E{L} = θ (n − 1) (Watterson, 1975;Zeng et al., 2006).Moreover, we have E 3 ^` T for the pairwise mean of nucleotide differences, defined Nei and Li, 1979;Tajima, 1989) and Fay and Wu, Table 1.Example of the site frequency spectrum {ξ i } = (ξ 1 = 2, ξ 2 = 3, ξ 3 = 2, ξ 4 = 2, ξ 5 = 0) with the total number of segregating sites S = 9 and the sample size n = 6.
SNP sites core The core SNP site (r) is arbitrarily set at r = 6 with n r = 3.The values of n k and n rk for 1 ≤ k ≤ 9 determine the barcodes at these nine SNP sites, retaining their spatial positions.The conditions for recombinants are n rk > 0, n k > n rk , n r > n rk and n > g k = n k + n r -n rk as in Eq. (3a) in the text.Detecting ongoing selective sweeps 2000; Ferretti et al., 2017).Notably, these mean values are all independent of recombination rates, although the variances are not.In a large sample, it is likely that ξ i = 0 for many large i values.For this reason, we bin individual SNP sites into the following eight classes, although the number of such frequency classes depends on sample size n.Because of its preponderance, the first class (c 1 ) consists solely of singletons and is expected to have the mean of E{ξ 1 } = θ (Fu and Li, 1993).The second class (c 2 ) consists of doubletons and tripletons with E{ξ 2 + ξ 3 } = 5θ/6.Likewise, the remaining classes are c 3 with  -Walker et al., 2006).In our method, however, this hierarchical binning of SFS plays an essential role in extracting information on "intra-allelic variability" (Slatkin and Rannala, 1997).In this context, it is instructive to recall that the higher the allele frequency, the older the mean age of an allele (Kimura and Ohta, 1973;Slatkin and Rannala, 2000).
In human populations, the expected rough uniformity of SFS over the eight classes is often severely violated.The SFS generally reveals an excess of rare as well as intermediate or high allele frequencies, thereby supporting demographic models of recent population growth and past bottlenecks, in particular in non-African populations (e.g., Schaffner et al., 2005;Gutenkunst et al., 2009;Liu and Fu, 2015;Terhorst et al., 2017).Alternatively, positive selection may leave similar signatures at linked neutral sites (e.g., Kaplan et al., 1989;Braverman et al., 1995;Hermisson and Pennings, 2005;Evans et al., 2006;Ronen et al., 2013;Schrider and Kern, 2017).In the following analysis, we inquire whether and how we can isolate the effect of positive selection from that of demographic causes.
We pay special attention to one particular SNP site, henceforth termed a core site or the r-th SNP site, which might be a target of positive selection.This core site is assumed to be given or of interest, and does not require any a priori knowledge about its evolutionary aspects.Our aim is to examine whether there exists a genomic region hitchhiking and in linkage disequilibrium (LD) with derived alleles at a core site.As a rule, the tighter the linkage, the stronger the hitchhiking effect; also, the broader the genomic region, the more recent the positive selection.There are n r r l l n ¦ F 1 derived alleles at a core site.Using these n r derived alleles and ancestral alleles at a core site, we divide our sample of n homologous chromosomes into two mutually exclusive groups: derived and ancestral allele groups.The derived allele frequency at the core site is equal to f r = n r /n and the frequency class (c r ) to which the core site belongs is also specified by n r .For example, if the site is located at r = 6 as in Table 1, we have n 6 = 3, which belongs to c 2 , and divide the sample into the derived allele group of (1,2,3) chromosomes and the ancestral allele group of (4,5,6) chromosomes.This definition of allele groups and the following procedure can be extended, without any change, to the case where more than one SNP site determines two core haplotypes.Figure 1 shows one such example in which two core haplotypes, CGC vs. non-CGC, are defined at three SNP sites (rs3759916, rs3759915 and rs3759914) that are located in the promoter region of the human sialyltransferase (ST8SIA2) locus (Fujito et al., 2018).This gene is schizophrenia-associated and expressed preferentially in the brain, with the level being largely determined by the promoter SNPs.It is suggested that the expression level is a genetic determinant of schizophrenia risk and a non-risk SNP type (CGC-type) has significantly reduced promoter activity.In Asia and Europe, the major promoter type is not CGC but TCT.
Next, we compute the number (n rk ) of derived alleles at the k-th SNP site that simultaneously occur with derived alleles at the core site r.This number becomes non-zero only when both χ kl = 1 and χ rl = 1 hold true for some chromosome l.The n rk is thus counted as and provides a measure of pairwise LD between the core site and any other neighboring SNP site.The n k in Eq.
(1) ranges from 1 to n − 1 and the n rk in Eq. ( 2) ranges from 0 to n k or n r , whichever is smaller.In Fig. 1, these numbers are represented as "barcodes" that depict SNP information by two-tone colored bars.
The barcode representation of n k and n rk differs from the traditional SFS in that it preserves spatial information about SNP sites and is stratified in eight layers that are on average in proportion to their ages.Figure 1 depicts derived alleles in c 6 and c 7 of ST8SIA2.In a central region in c 7 , 12 equally tall red bars for n rk derived alleles are linked in the derived allele group; these sites may also contain a small number of derived alleles that are linked in the ancestral allele group (gray bars for n k − n rk derived alleles).These barcodes indicate that the central region is in tight linkage with respect to the core site and the 12 SNP sites show 12 mutations that have accu-mulated in the common ancestral lineage of n rk chromosomes.Later, it is more rigorously determined that this central LD region is approximately 16 kb long and ranges from 24 kb to 40 kb distant from the left end.When the central region is extended to left and right regions with each being approximately 19 kb long, it immediately becomes apparent that derived alleles in these extended regions tend to be linked with not only derived but also ancestral alleles at the core site.Under the infinite-sites model with recombination (Kimura, 1969), there are only two possibilities that can result in such allele configurations at two sites.One possibility is that derived alleles at the k-th site are of ancient origin and can therefore appear together with the derived alleles at the core site (e.g., k = 8 in Table 1).In this case, we have n rk = n r < n k .The other is that derived alleles at the k-th site experienced recombination with respect to the core site.In this case, we have n rk < n r and n k .More thoroughly, the conditions for the latter case of recombination are that all of are strictly positive as in the fourgamete test (Hudson and Kaplan, 1985): (3a) In the example given in Table 1, only the two sites k = 4 and 9 satisfy these conditions with respect to the core site r = 6.In the barcode representation, the first two inequalities in Eq. (3a) are indicated by a two-tone colored bar at the k-th site, whereas the third is indicated by the difference between two red bars at the core site and the k-th site, and the last is not depicted explicitly but corresponds to the case where n − n r ancestral alleles at the core site are greater in number than derived alleles in a gray bar at the k-th site.This allele association between the core site r and the k-th site is commonly measured by the following LD formulae: where f rk = n rk ⁄n, f k = n k ⁄n and f r = n r ⁄n as before.The above "r" in r 2 (Hill and Robertson, 1968) should not be confused with the same symbol r used for the position of a core site or for the recombination rate given in the Discussion section.The |D| or r 2 in Eq. ( 3b) is useful to grasp an outline of a core region: it constitutes a high LD region that includes a core site and exhibits a low level of derived allele polymorphism.Figure 1 also  (Fujito et al., 2018).Detecting ongoing selective sweeps plots pairwise r 2 values in c 6 and c 7 , revealing that the central region exhibits strong LD with the core site and is bounded by abruptly declined LD regions on both sides.Of importance is the examination of LD in individual classes within each of which SNPs are age-related and tend to disclose the linkage relationships.To further delimit the boundaries of the core region, we carried out exhaustive window analyses of F c and determined the region that showed the minimum value.In Fig. 2, we draw the entire barcode representation of the same genomic region as in Fig. 1 with the CGC haplotype and the alternative TCT haplotype as a respective core, and the regions on both sides are used for comparison with the core region and the size is taken as being almost the same as the core region.Two features emerge.Signals for tight linkage in the core region are characteristic to the CGC haplotype group, whereas several SNP sites in the TCT haplotype group fall under suspicion of recombination.It is therefore consistent with the view that the extent of polymorphism within the TCT haplotype group is relatively high and the age is relatively old.Conversely, the pattern and extent of derived allele polymorphism in the CGC haplotype group are qualitatively and quantitatively different from those in the TCT haplotype group, suggesting rapid lineagespecific expansion of the former.This difference is captured by contrasting values of F c , which is defined below.
Provided that we have identified a core region, we are now in a position to quantify such lineage-specific expansion by an appropriate measure.We define F c as the ratio of the number of derived alleles in the derived allele group to that in the entire sample, where: for n rk and n k at all the SNP sites in certain frequency classes (c).The numerator and denominator are calcu- lated in the same genomic region so that one can act as an internal control of the other.However, it is strongly recommended to exclude not only the own class (c r ) to which the core site belongs, but also the higher-than-own classes altogether, because these classes contain overwhelming numbers of derived alleles that accumulated in the common stem lineage of a derived allele (haplotype) group and overshadow useful information regarding the internal structure of descendant lineages.For this reason, we may rewrite Eq. ( 4) as F c r

<
to express explicitly that it is computed in all the classes lower than the core allele frequency class.This F c or F c r < quantifies intraallelic variability in the sense of Slatkin andRannala (1997, 2000) without invoking any prior knowledge regarding intra-allelic genealogy.It is shown by simulation that the F c statistic in Eq. ( 4) satisfactorily reflects genealogical structure within a derived allele group.It is important to realize that F c ≈ f r if no linkage relationship is expected between the core site and all other sites under consideration.
Table 2 shows the estimates of F c in each of the eight frequency classes of ST8SIA2 among East Asians in the 1000 Genomes Project database.These Fc are obtained separately in the central core region and, for comparison, in the neighboring region of 38 kb length as well (the rightmost column in Table 2).Here and subsequently, a caret over a statistic indicates an estimate.As the copy number of the CGC haplotype is 349 in the present sample of n = 1008, the core haplotype belongs to c 7 so that Eq. ( 4) is applied to all the classes lower than c 7 .≈ 0.015 in the CGC haplotype group is in sharp contrast to F c < 8 = 0.349 in the TCT haplotype group (Fig. 2).Although the large F c < 8 is due partly to the high core frequency fr = 0.51 of the TCT haplotype, the rest is attributed to a large number of TCT-linked derived alleles in c 6 and c 7 .In contrast, the extremely low F c < 7 value in the CGC haplotype stems from the virtual absence of CGC-linked derived alleles that belong to c 4 to c 6 despite abundant non-CGClinked derived alleles (Table 2).In terms of genealogy, the deficiency of Fc 4 to Fc 6 corresponds to the presence of a large cluster without any solid family relationship among its members after they originated from a common ancestor.We take this unusually unstructured pattern and lack of derived alleles as evidence for rapid expansion of the CGC haplotype lineage.Exceptions are a few SNP sites at which there are one and two CGC-linked derived alleles in total in c 5 and c 6 , respectively (Table 2).As these CGC-linked derived alleles occur together with a large number of non-CGC-linked derived alleles at the same SNP sites, recombination is the most likely cause of these minor associations.Thus, the central core region we have just identified may not be perfectly free from recombination.Such imperfection is practically inescapable in actual data analyses and thus prompts us to be conservative in testing a null hypothesis.
Equally importantly, the observed Fc values in low frequency classes are much higher than those in high frequency classes (Table 2).In fact, we have ˆ.
F c 1 2 30 93 0 323 in classes c 1 − 2 in contrast to ˆ. F c 3 6 11 2709 0 004 in classes c 3 − 6 .This increased level of rare variants may stem from the fact that when new mutations begin to accumulate within the derived allele group, the rate of return to equilibrium is faster in lower frequency classes than in higher frequency classes (Satta et al., 2018).Alternatively, the rate of return to equilibrium may suffer from recombination and/or the phasing problem for rare variants unless long-read sequencing is applied (cf.Field et al., 2016).In either case, inclusion of such rare variants inflates the estimate of F c < 7 to the level of 0.015.Before going further, separately to evaluate the effect of new mutations.We found that the ratio of F c 1 2 − to F c 3 6 − seldom exceeds the observed level of about 79 for a wide range of recombination rates (data not shown).Henceforth, it appears that the large value of Fc 1 2 − relative to Fc 3 6 − can be attributed to inaccurate haplotype phasing.Moreover, it is worth mentioning that under the standard neutral model, the F c < 7 statistic is fairly insensitive to rare variants.For instance, the 1% threshold value (0.0346) of

TESTING NEUTRALITY
To evaluate the false-positive rate (the probability α of a type I error) in our method, we performed ms (Hudson, 2002) and/or fastsimcoal (Excoffier and Foll, 2011) using the standard neutral model of constant N e together with several demographic models proposed for modern human prehistory (e.g., Schaffner et al., 2005;Gutenkunst et al., 2009;Liu and Fu, 2015;Terhorst et al., 2017).In doing so, we restricted our examination to demographic models inferred for the East Asian or European metapopulation and ignored migration with other continental populations.Moreover, these models differ from each other in detail.For instance, the model of Terhorst et al. (2017) is somewhat different from that of Schaffner et al. (2005) in the strength and timing of bottlenecks even if we discretize the population size estimated in the former.Accordingly, here we use two simplified versions of Schaffner et al. (2005) and Terhorst et al. (2017) that are subsequently referred to as SFD and TFD, respectively (Supplementary Fig. S1).
In simulating a core LD region of ST8SIA2, we assume no recombination and set an estimated mutation rate in a core region ( θ ) or the observed number of segregating sites ( Ŝ ).The choice between θ and S does not matter under the standard neutral model as long as S = θ a n approximately holds and both parameters are sufficiently large (e.g., S ≥ 100); however, the choice may make a difference under the SFD and TFD models.For these models of changing population size, we use the observed S value to avoid unnecessary complications.First, we examined the F c < 7 statistic under the standard neutral model and carried out 1,000 replications with a specified range of derived allele frequencies at a core site (0.35 ± 0.015).The distribution of F c < 7 becomes unimodal with the mean and standard deviation (SD) of 0.26 ± 0.15 (Fig. 3).The threshold value of F c < 7 for α = 0.01 is 0.038 and the observed value of F c < 7 ≈ 0.015 is significantly small with α < 0.001.We can thus reject the standard neutral model with a very low false-positive rate.
Under the SFD model, we obtained the distribution of F c < 7 with the mean and SD of 0.24 ± 0.18, finding that it becomes somewhat broad and skewed toward 0. Because of this and if the SFD model is treated as a null hypothesis, the false-positive rate increases; however, the threshold value of F c < 7 for α = 0.01 is as large as 0.026 and F c < 7 ≈ 0.015 is still significantly small (α < 0.002).Likewise, under the TFD model, we obtained the distribution of F c < 7 with the mean and SD of 0.24 ± 0.19 and found that the threshold value for α = 0.01 is 0.017 and that α < 0.004 for F c < 7 .Hence, changing population size indeed increases the false-positive rate, but the effect on the F c statistic is surprisingly small.Conversely, if the SFD and TFD models are treated as alternatives, the power of the F c statistic defined by 1 − β (the probability β of a type II error) is too low to detect the effect in comparison with the standard neutral model.In fact, setting α = 0.05 with the corresponding threshold value of F c < 7 = 0.07, we found that β > 0.8 under the SFD and TFD models.This high probability of a type II error or the low power reflects the broad and largely overlapping distributions of F c between the standard neutral model of constant size and demographic models of changing population size (Fig. 4).In our subsequent analyses, we thus use the traditional SFD model as a null hypothesis rather than the standard neutral model unless otherwise specified.

UNDER A SELECTIVE SWEEP
To examine the statistical power of F c when a selective sweep is an alternative, we carried out coalescent simulation with selection at a single site that incorporates a selected allelic lineage into the neutral genealogy background.We used discoal, which generates either deterministic or stochastic trajectories for selective sweeps (Kern and Schrider, 2016); readers may also refer to mbs by Teshima and Innan (2009) and MSMS by Ewing and Hermisson (2010).For a given value of the populationscaled selection intensity N e s, discoal efficiently simulated soft sweeps for a given initial frequency f 0 and a given current frequency f r of a selected allele (Supplementary Fig. S2).
To imitate a putative soft sweep at ST8SIA2 (Fujito et al., 2018), we set f 0 as 0.008 (the CGC haplotype frequency in Africans) and f r as 0.35 (the CGC haplotype frequency in East Asians).As expected, the distribution of F c becomes narrow, and for N e s = 200 and θ = 20, it concentrates in a small range of 0.00053 − 0.21 with the mean and SD of 0.020 ± 0.024 (Fig. 5).The threshold value for α = 0.01 is accordingly as small as 0.001.To evaluate the power, we further examined various values of N e s.When compared with the mean and SD of F c (0.24 ± 0.18) for N e s = 0, they sharply decrease to 0.077 ± 0.042 for N e s = 50, 0.050 ± 0.028 for N e s = 100, and 0.039 ± 0.029 for N e s = 150 (Supplementary Fig. S3). Figure 6 shows the power of the F c statistic against N e s for three type I error rates (α = 0.1, 0.05, and 0.01 under the null hypothesis of the standard neutral model).For α = 0.01, the F c statistic can exhibit the power of 1 − β = 0.64 if N e s is 100.As N e s increases up to 200, the power fur-ther increases to > 0.78 by proportionally reducing the type II error rate β.Conversely, if N e s is ≤ 10, the power remains poor and the F c statistic likely fails to detect such weak selection signals.
We further applied our method to three human genetic loci that are known to have undergone hard selective sweeps.The best known is perhaps the locus for the adaptation of lactase (LCT) persistence that has occurred independently in Africans and Europeans (Tishkoff et al., 2007).Surveying a 1-Mb region surrounding rs4988235 (Supplementary Fig. S4A), we found that although the whole region under the hitchhiking effect may exceed 500 kb, a 50-kb central region is in extremely tight linkage and exhibits marked deficiency of core-linked derived alleles in the European meta-population.Because the current frequency ( f r ˆ) of the core derived allele is 0.51, the core allele group belongs to c 8 and the F c < 8 within the 50-kb core region is calculated as 0.006.To assess the statistical significance of the F c < 8 with the f r ˆ, we simulated the SFD model for the European meta-population and a hard sweep model as an alternative hypothesis.The SFD simulation with S = 263 observed in the core region yielded a threshold value of F c < 8 for α = 0.01 at 0.032, which indicates that the observed value of F c < 8 = 0.006 is too small to be compatible with the SFD model (α < 0.001).Conversely, the sweep model of genic selection with θ = S⁄a n = 35 and N e s = 200 yielded β = 0.218 for α = 0.01, suggesting that the hard sweep on LCT is much stronger than expected from N e s = 200 (Fig. 7).
As the third example, we examined the gene encoding ectodysplasin A receptor (EDAR), which is associated with Asian hair thickness (Fujimoto et al., 2008).At the core site 1540 (rs3827760) within the coding region, the Asianspecific derived allele C resulted in a non-synonymous change and increased its frequency to f r ˆ = 0.85 − 0.90 in CHB (Han Chinese Beijing) and JPT (Japanese Tokyo) populations, most likely after East Asians diverged from Europeans.The slow rate of LD decay measured by extended haplotype homozygosity (EHH) in Sabeti et al. (2002) and high extents of population differentiation measured by F ST both strongly supported an ongoing selective sweep (Fujimoto et al., 2008).Consistent with this, we found a 73-kb core region with F c < 8 = 0.196 (Supplementary Fig. S4C) and noted that this relatively high value of the F c < 8 is correlated with the high incidence of the 1540C allele in the Chinese and Japanese populations.The simulation revealed that for α = 0.01, the threshold value of F c < 8 is 0.233 under the SFD model with S = 543 and β = 0.107 under the sweep model with θ = 72 and N e s = 200 (Fig. 7).Thus, the F c < 8 is significantly small (α = 0.004) and supports the notion that the hard sweep on EDAR is not much weaker than expected from N e s = 200.tion and acts as a skin-lightening allele under epistatic interactions with the melanocortin-1 receptor gene.As with LCT, we first performed the barcode analysis for an approximately 1-Mb region surrounding rs1800414 (Supplementary Fig. S4B).The core region becomes 49 kb long and yields F c < 8 = 0.086.We simulated the SFD model with the f r ˆ and S = 320 for the East Asian metapopulation as well as the same sweep model with θ =

DISCUSSION
In our illustrative example, we examined the significance of the F c statistic under the assumption of a fixed core allele frequency f r = 0.35.However, as with EHH (Sabeti et al., 2002) and other statistics of the LRH (Vitti et al., 2013), F c also depends noticeably on f r .It is therefore necessary to carry out appropriate simulation studies to evaluate the statistical significance of observed F c and/ or to examine the empirical distribution for a number of its realized values in a genome (see Fujito et al., 2018 for comparison of various methods in their statistical power when applied to ST8SIA2).Nevertheless, to obtain a rough idea, we considered it useful to tabulate the f r dependency of threshold values of F c and other related statistics under the standard neutral model.In making such a table, we further assumed fairly large values of θ or S to reduce stochastic errors that might be induced by low mutation rates and, as a consequence, become more appropriate for EDAR with a larger core region than that of ST8SIA2.Table 3 shows that the threshold value of F c is very sensitive to f r while simultaneously illustrating the atypical characteristic of LCT with f r ˆ = 0.51 and F c < 8 = 0.006.In general, if f r ≤ 0.6 or f r ≥ 0.7, the mean (F c ) of F c is greater or smaller than the median, and the distribution is positively or negatively skewed accordingly.
Although the distribution of F c is generally very broad under neutrality, it may be claimed that F c < f r can be expected only when recombination is rare or absent, because only particular branches in a coalescent tree can produce a cluster of chromosomes with specified frequency f r (Fu, 1995;Li, 2011;Ferretti et al., 2017;Yang et al., 2018) and mutations in the common stem lineage are not counted in F c .To illustrate this, let us consider a simple case of n = 4 (Supplementary Fig. S5).For f r = 3/4, we can uniquely determine the tree topology in which only one suitable internal branch satisfies the frequency condition.Assuming the number of nucleotide changes is ideally in proportion to the average coalescent branch length, we can compute F c < 3 = 0.69 rather than 3/4.Similarly, for f r = 1/2, we obtain F c 1 = 0.38 rather than 1/2.Hence, F c is slightly smaller than f r in a coalescent tree.However, it appears that the difference becomes even larger for a large sample size n and for a high current frequency f r (Table 3).Alternatively, if recombination occurs frequently, F c should approach f r because a core site and any other site tend to be in linkage equilibrium regardless of frequency classes.Simulation confirmed this assertion (Supplementary Fig. S6).Conversely, the observation of a large difference between f r and F c thus constitutes a first favorable signal of LD being worthy of further scrutiny.
When a derived allele at a core site increases in frequency and is eventually fixed in a population, we can no longer compute the F c statistic.SLC24A5 is one of two known loci strongly affecting light skin pigmentation in Europeans and shows a fixed derived allele in the CEU population (Utah residents with Northern and Western European ancestry; Olalde et al., 2014).Although the variant is fixed in the population, the hierarchical barcodes of an approximately 20-kb region surrounding rs1426654 still indicate marked deficiency of SNPs in c 6 and c 7 in addition to a notable recovery of polymorphism in the lower classes (data not shown).As these features are likely footprints of a complete selective sweep, it is both worthwhile and possible to quantify such characteristics by some other measures than F c .Kimura et al. (2007) considered two measures of the LRH and developed a method for when a selected allele has reached fixation in a local population (see also Maynard Smith and Haigh, 1974;Kaplan et al., 1989;Ronen et al., 2013;Schrider and Kern, 2017).Alternatively, we may apply the F c statistic in a polymorphic population such as South East Asians Table 3. f r dependency of the threshold values of the F c statistic for specified type I error rates α and the same dependency of the mean and standard deviation (SD).As a window for a specified value of f r must be set, it is assumed to be f f f n r r r r 1 / .Under the assumption of the standard neutral model with n = 1008 and θ = 100 or S = 1000, the F c is computed for < c 6 when f r = 0.1, for < c 7 when f r = 0.2 to 0.4 and for < c 8 when f r ≥ 0.6.The exception is the case of f r = 0.5, for which the average of F c s is taken over realized values for < c 7 and < c 8 .For each set of parameters, the number of replications is 1000.The differences between the two tables can likely be attributed to replication errors.In SAS, we actually found a 134-kb core region with f r ˆ = 0.60 and F c < 8 = 0.136.Although the probability of a type I error (α = 0.03) may not be sufficiently small, this finding suggests that positive selection has acted on SLC24A5 in SAS as well.
In most simulations, we have assumed no recombination within a core region, not addressing the effect of recombination on neighboring regions.Clearly, it is worth demonstrating the effect on both the barcode representation and the F c statistic.For this purpose, we ran discoal with recombination under θ = 20, N e s = 200 and f r ˆ = 0.35.Although it is necessary to assume a rather small sample size of n = 200 or less for technical reasons, the c 6 barcode thus generated reveals patterns similar to the c 7 barcode in Fig. 1 (Supplementary Fig. S7).We also confirmed that the F c < 6 becomes approximately 0.01 in the central region, but increases to 0.34 on both sides.In short, simulation with selection and recombination can reproduce the pattern and level of polymorphism that resemble those observed in some human genomic regions under selective sweeps.More intriguingly, we found that the F c statistic is little affected even when the recombination rate (r) per region is fairly high (e.g., ρ = 4N e r ≥ 10).This unexpected robustness certainly depends on how to evaluate intra-allelic variability under selective sweeps.As mentioned earlier, the F c statistic is intended to exclude SNP mutations with n k ≥ n rk = n r at site k because they have accumulated in the stem lineage of a derived allele group.However, recombination, if it occurs in one or a few lineages in the group, may act so as to incorporate some of those stem-lineage mutations into intra-allelic variability and substantially enhance it: this is because although the number of remaining unrecombined lineages is smaller than n r , they can be still large in size (large n rk ).Fortunately, it is likely that SNP sites thus created within the derived allele group remain in the same frequency class as the core site c r .In this case, the F c statistic in Eq. ( 4) simply ignores those n rk mutations from intra-allelic variability.
Next, we briefly discuss one issue concerning the distinction between hard sweeps and soft sweeps, the latter being the more dominant mode of adaptation (Schrider and Kern, 2017).As a soft sweep may drive multiple standing variants at linked neighboring sites from the beginning and retain the polymorphism to some extent even if selection has been completed, it is in general more challenging to detect a soft sweep than a hard sweep.Peter et al. (2012) concluded that whereas LCT and EDAR are subjected to hard sweeps, ASPM (abnormal spindle-like microcephaly-associated) (Mekel-Bobrov et al., 2005) is likely affected by a soft sweep.Owing to the similarity to ST8SIA2, we examined a core SNP site in ASPM with f r ˆ = 0.41 in Europeans and found that there exists a 26-kb core region with F c < 7 = 0.052.Under the assumption of a hard sweep with θ = 20 and N e s = 200, contrary to Peter et al. (2012), the threshold value of F c < 7 becomes 0.029 for α = 0.01 and the corresponding value of β becomes 0.186 (Fig. 7).These results may be taken as evidence for positive selection on ASPM, though not very strong.However, unlike that for a hard sweep, the power (1 − β) for detecting a soft sweep depends strongly on the initial frequency f 0 of a selected allele (Innan and Kim, 2004;Hermisson and Pennings, 2005;Przeworski et al., 2005).For this reason, we examined the distribution of F c for various values of f 0 under the conditions of θ = 20, n = 1008 and N e s = 200.Although the power is still as high as 0.74 when f 0 = 0.01, it sharply declines as f 0 further increases: 0.26 for f 0 = 0.05, 0.12 for f 0 = 0.1, and almost 0 for f 0 = 0.2, which is nearly halfway to the current frequency f r .Without knowing the initial frequency of a selected allele, as is usually the case, it seems almost impossible to make definite conclusions about the power of a method for detecting soft sweeps (but see Satta et al., 2018 for one possibility).Nevertheless, if ASPM showed more unambiguous non-neutrality (i.e., α < 0.01) concerning intra-allelic variability, it could have also been suggested that the initial frequency of the selected allele was no greater than 1% when a soft sweep began to operate; such a low initial frequency appears consistent with the geographic distribution of the ASPM variant, which is confined to Eurasia.As a final caveat, whereas we have shown that changing population size does not markedly affect the F c statistic, the demographic models examined here are not exhaustive.Many other demographic models or events potentially influence the F c statistic.These include range expansion with so-called "surfing" in a wave front that may mimic a selective sweep and be relevant to the present context (Edmonds et al., 2004;Klopfstein et al., 2006).However, this and other demographic causes such as population structure and admixture are beyond the scope of the present paper.
Source code for F c statistic The source code (python) for the F c statistic is available freely from our web site (https://sites.google.com/site/sattalab/research).

Fig. 1 .
Fig. 1.Barcodes for the numbers of derived alleles (C6 A and C7 A) and r 2 values for the extent of linkage disequilibrium (C6 B and C7 B) in classes c 6 and c 7 at the ST8SIA2 locus.Magenta bars represent the number (n rk ) of derived alleles at the k-th SNP site that are linked in the core CGC haplotype, whereas gray bars represent the number (n k − n rk ) of derived alleles that are linked in the core non-CGC haplotypes, as described in the text.The height of each bar n k in c 7 ranges from 186 to 503 and that of each magenta bar ranges from 0 to 349 according to the number of CGC chromosomes in East Asians.The abscissa spans approximately 54 kb and the location of the core CGC sites is indicated by a single asterisk at the bottom(Fujito et al., 2018).

Fig. 2 .
Fig. 2. Entire barcode representation of the SNPs in a 54-kb region at the ST8SIA2 locus.The three core SNPs are located in the right middle (an asterisk in each class) and define the CGC and TCT haplotype groups with frequency 349/1008 = 35% and 515/1008 = 51%, respectively, in East Asians (the 1000 Genomes Project database).Note that according to the copy number of derived alleles at a site, the vertical scale differs greatly among different classes: c 1 (top) to c 8 (bottom).

F c 3 6 −
obtained under neutrality is approximately the same as 0.0384 of F c < 7 .This reflects the fact that under such a null equilibrium model, each frequency class contributes almost equally to F c < 7 .In other words, F c 3 6 − and F c < 7 exhibit similar false-positive rates in rejecting neutrality (see below), and the use of F c < 7 instead of Fc 3 6 − may make the test conservative, depending on the quality of sequence data we use.

Fig. 3 .
Fig. 3. Distribution of F c < 7 (1 ≤ n k ≤ 185) under the standard neutral model, provided that S = 146 and the frequency (f r ) of derived alleles at a core site is within 0.35 ± 0.015.With over 1,000 replications, the mean and standard deviation of F c are 0.255 ± 0.147.
Our estimate of ˆ.

Table 2 .
Proportions of derived alleles that are linked to the CGC haplotype (n rk ) at the human ST8SIA2 locus.Data are taken from East Asians (n = 1008) in the 1000 Genomes