Two-dimensional site frequency spectrum for detecting, classifying and dating incomplete selective sweeps

The two-dimensional site frequency spectrum (2D SFS) was investigated to describe the intra-allelic variability (IAV) maintained within a derived allele (D) group that has undergone an incomplete selective sweep against an ancestral allele group. We observed that recombination certainly muddles the ancestral relation-ships of allelic lineages between the two allele groups; however, the 2D SFS reveals intriguing signatures of recombination as well as the genealogical structure of the D group, particularly the size of a mutation and the time to the most recent common ancestor (TMRCA). Coalescent simulations were performed to achieve powerful and robust 2D SFS-based statistics with special reference to accurate evalua-tion of IAV, significance of recombination effects, and distinction between hard and soft selective sweeps. These studies were extended to a case wherein an incomplete selective sweep is no longer in progress and ceased in the recent past. The 2D SFS-based method was applied to 100 intronic linkage disequilibrium regions randomly chosen from the East Asian population of modern humans to examine the P value distributions of the summary statistics under the null hypothesis of neutrality in a nonequilibrium demographic model. We argue that about 96% of intronic variants are non-adaptive with a 10% false discovery rate. Further-more, this method was applied to six genomic regions in Eurasian populations that were claimed to have experienced recent selective sweeps. We found that two of these genomic regions did not have significant signals of selective sweeps, but the remaining four had undergone hard and soft sweeps and were dated, in terms of TMRCA, after the major out-of-Africa dispersal of modern humans.


INTRODUCTION
Based on the site frequency spectrum (SFS) of single nucleotide polymorphisms (SNPs) in the linkage disequilibrium (LD) region of a focal (core) site under positive Darwinian selection, we previously proposed an inference method that represents SFS and LD information as hierarchical barcodes and measures sweep signals by a newly defined statistic, F c (Fujito et al., 2018a). If a selective sweep is incomplete (Sabeti et al., 2002;Pickrell et al., 2009;Peter et al., 2012;Ferrer-Admetlla et al., 2014;Field et al., 2016), derived and ancestral alleles coexist and segregate at a core site in the sample, and these core alleles define two allele (or haplotype) groups in the extended LD region, similar to that in the long-range haplotype (LRH) test (Vitti et al., 2013;Fan et al., 2016). Polymorphism within a derived allele (D) group under positive selection was termed intra-allelic variability (IAV) by Rannala (1997, 2000), and F c evaluates it as the proportion of derived alleles in the D group relative to those in the entire sample. It was shown that F c is robust not only to the temporally changing demography by measuring relative IAV, but also to recombination by excluding high-frequency derived alleles. However, while the procedure of exclusion is simple, it is entirely heuristic. Thus, it is desirable to further develop relevant theories and methods. More importantly, although recombination certainly obscures the allelic genealogy in a sample, the outcome is intimately related to the genealogical structure of IAV, especially the family size or the size of a mutation that occurs in a branch with a specified number of descendant lineages (Fu, 1995).
A selective sweep can be either hard (Maynard Smith and Haigh, 1974;Kaplan et al., 1989;Stephan et al., 1992;Braverman et al., 1995) or soft (Orr and Betancourt, 2001;Innan and Kim, 2004;Hermisson and Pennings, 2005;Przeworski et al., 2005). A hard sweep is driven by a single beneficial haplotype, whereas a soft sweep is driven by multiple beneficial haplotypes that may form because of recurrent mutation/migration or as standing variations (Pennings and Hermisson, 2006;Teshima et al., 2006;Akey, 2009;Scheinfeldt et al., 2009;Peter et al., 2012). Fujito et al. (2018a) considered a soft sweep for generality, but they could not develop a method that can distinguish between the two modes of adaptation. Using 11 summary statistics of selection tests such as those in Tajima (1989), Fay and Wu (2000) and Sabeti et al. (2002), Pybus et al. (2015) developed a machine learning framework to detect and classify hard sweeps in human populations. Also, using a machine learning method trained for the estimated population mutation rate (Watterson, 1975;Nei and Li, 1979), haplotype homozygosity (Garud et al., 2015;Harris et al., 2018, for unphased genotype data) and LD (Kelly, 1997), Schrider and Kern (2017) found that soft sweeps prevail in recent human adaptation (Ronen et al., 2013;Sheehan and Song, 2016). Schrider and Kern's method could infer the mode of selection with robustness to nonequilibrium demography and was applied to the case of complete selective sweeps. However, to identify incomplete selective sweeps, different summaries and/or theoretical approaches are feasible or even preferable, as discussed by Grossman et al. (2010) and many others (Vitti et al., 2013;Ferrer-Admetlla et al., 2014;Ronen at al., 2015;Fan et al., 2016;Schrider and Kern, 2016;Akbari et al., 2018).
Incomplete selective sweeps may provide an unprecedented opportunity to distinguish effects of positive selection from those of demography. Among many demographic causes, allelic "surfing" Klopfstein et al., 2006) argues against positive selection invoked for the gene regions of ASPM and MCPH1 that are associated with human brain development (Kouprina et al., 2004;Evans et al., 2005;Mekel-Bobrov et al., 2005). Surfing is a phenomenon caused by a few genetically homogeneous individuals on a wave front in a geographically expanding population, who can produce disproportionate numbers of descendants by chance. Nielsen et al. (2007) doubted if it would be possible to develop methods that can distinguish between allelic surfing and positive selec-tion with high accuracy. However, unlike positive selection, surfing should not distinguish between derived and ancestral alleles at a core site. Moreover, in the case of modern humans, surfing does not appear to be an ongoing phenomenon; rather, it is very likely associated with the past range expansions that might occur after the out-of-Africa dispersal, as presumed in Currat et al. (2006). In this case, it is interesting to determine if any incomplete selective sweeps are really ongoing. These two aspects of the target allele and timing of positive selection, at least in principle, allow the possibility of validating or invalidating the hypothesis of allelic surfing.
Apart from the above-mentioned question, a time period during which positive selection exerted or terminated its action has gained considerable attention in particular relation to human prehistory. For instance, using a rejection sampling method, Przeworski (2003) estimated the time since the fixation of a beneficial allele at a locus such as FOXP2 (Enard et al., 2002;Coop et al., 2008;Ayub et al., 2013). Based on the draft sequence of the Neanderthal genome, Green et al. (2010) identified a number of nonsynonymous substitutions that have been fixed in ancestral modern humans. Recently, using a cross-population composite likelihood ratio, Racimo (2016) inferred complete selective sweeps that occurred after the split of modern humans from Neanderthals, but before the split of Africans and Eurasians. For incomplete selective sweeps, Itan et al. (2009) and Peter et al. (2012) provided age estimates of selected alleles in an approximate Bayesian computation (ABC) framework (Chen et al., 2015, from haplotype lengths by a hidden Markov model). Furthermore, based on the comparison of SFS in multiple population genomes, Tournebize et al. (2018) recently developed a method to detect and date past and recent hard sweeps. However, to the best of our knowledge, there are few studies about the cessation of its action during the intermediate process of a selective sweep and the time elapsed since then (Garud et al., 2015;Hunter-Zinck and Clark, 2015;Pybus et al., 2015;Ferretti et al., 2017).
In this paper, we investigate a two-dimensional SFS (2D SFS) to describe IAV under an incomplete selective sweep. By coalescent simulations (Hudson, 2002;Kern and Schrider, 2016), we show how mutation, recombination, genetic drift and two modes of selective sweeps form the pattern and level of a 2D SFS. We focus on the family size and the time to the most recent common ancestor (TMRCA) of the derived alleles within a D group, as the genealogy is clearly not a star-shaped one, even under a strong selective sweep (Ferretti et al., 2017;Yang et al., 2018, and references therein). Similarly, we examine several classical and newly defined summary statistics that may facilitate the identification of the genealogical structure of IAV, evaluate their statistical performance, and apply some of them to a random sample of genomic regions from the database (1000GD) of the 1000Genomes Project Consortium (2015 as well as six genomic regions that are claimed to have undergone incomplete selective sweeps. Furthermore, we study the persistence time of the detectable signatures of an incomplete selective sweep that ceased in the past.

THEORIES AND METHODS
2D SFS The 2D SFS is formulated to extract information about an incomplete selective sweep that is inscribed as biallelic SNPs in a genomic region sampled from a single panmictic diploid population. Two allele groups, the D group of size m (1 ≤ m < n) and the ancestral allele (A) group of size n − m where n is the entire sample size, are defined by the derived/ancestral state of a predetermined focal site or a core site, with the derived allele frequency being f r = m/n. These groups are extended to an LD region or a core region surrounding it (Fujito et al., 2018a). The 2D SFS for linked neutral polymorphism in the core region is denoted by the matrix {φ i,j }, where the (i,j) element represents the number of segregating sites at which i (0 ≤ i ≤ m) derived alleles are found in the D group and j (0 ≤ j ≤ n − m) derived alleles are found in the A group. The total number of derived alleles is 1 ≤ k = i + j < n. We are concerned only with segregating sites in a sample and ignore φ 0,0 and φ m,n − m or set them to 0. φ i,0 denotes the number of segregating sites at which i derived alleles can be found only in the D group. For 1 ≤ i < m and 1 ≤ j ≤ n − m, the presence of φ i,j > 0 implies that, under the assumption of the infinitesites model (Kimura, 1969), these segregating sites have experienced recombination. Note that when φ i,n − m > 0, there are only three different types of gametes specified by the core site and a linked site, despite the fact that recombination between the sites is required to produce them. In other words, when we distinguish between the derived and ancestral alleles, the presence of four different types of gametes is not a necessary condition for the involvement of recombination.
The SFS can capture rich information about the SNP sites (Sawyer and Hartl, 1992;Bustamante et al., 2001). The entire distribution was used in the composite likelihood methods (Kim and Stephan, 2002;Nielsen et al., 2005;Williamson et al., 2007) as well as in inference methods about population demographic histories (Liu and Fu, 2015;Lapierre et al., 2017;Terhorst et al., 2017). However, the SFS is usually very sparse. As a 2D SFS is even more sparse, it is sensible and convenient to consider compressed versions simultaneously: , denote the one-dimensional (1D) SFS for the entire sample and the D group subsample, respectively.
In addition to the entire SFS, we consider various summary statistics, which include D by Tajima (1989), H by Fay and Wu (2000), E by Zeng et al. (2006), K instead of the original symbol L by Ferretti et al. (2017), and F c by Fujito et al. (2018aFujito et al. ( , 2018b. However, if D, H, E and K are applied to the entire sample, the statistical power may be weakened in the present case of incomplete or partial selective sweeps. We therefore apply them to the D group subsample and re-evaluate their statistical behavior in this circumstance. The pattern and extent of polymorphism within the D group are the key features to be estimated for detecting incomplete selective sweeps. One such feature is captured by the F c statistic, which can be rewritten in terms of {φ i,j } as ( Each summation is taken over all values of i and j in 0 ≤ i ≤ k m , 0 ≤ j ≤ min(n − m, k m ) and 1 ≤ i + j ≤ k m , where k m is the largest number of the k = i + j derived alleles in the frequency class c r − 1. Here, c r at a core site with m derived alleles (k m < m) is determined under the standard neutral model of constant size, hereafter referred to as the standard model (Fujito et al., 2018a). In a sample of size 503 < n < 1,367, k m takes one of 1, 3, 9, 25, 68, 185 and 503 in the seven possible frequency classes, with each having nearly equal numbers of segregating sites, and we set c r = 1 if m = 1, c r = 2 if 1 < m ≤ 3, etc. For the promoter region of ST8SIA2 with n = 1,008 and m = 349 in the East Asian (EAS) population (1000GD), we set c r = 7 from 185 < m ≤ 503; thus, k m = 185 in class 6. Using k m , we can rewrite the denominator in Formula (2) as The two vector components φ i,0 and  ζ i in Formula (1b) are most informative and convenient to describe IAV. We denote the maximum value of i by i max , which satisfies φ i,0 > 0. φ i,0 > 0 indicates two possibilities. One for a wide range of i with φ i,0 > 0 (or a large i max ) results from a soft sweep, wherein more than one distinct derived haplotype with respect to a core site has been subjected to positive selection since it began to act. As recombination is unlikely to generate φ i,0 > 0 under the circumstances discussed in this study (see below), this would provide strong evidence for soft sweeps, regardless of the  ζ i values. However, φ i,0 > 0 for some i values may occasionally be created by the cumulative effects of recombination, genetic drift and insufficient sampling processes (Fig. 1). It is then necessary to examine if  ] i ! 0 holds, because it is an unambiguous sign of recombination with respect to the core site, as discussed above and in the four-gamete test (Hudson and Kaplan, 1985), and cannot be produced directly by any form of selective sweep. The rationale behind this is as follows. Suppose that a strong selective sweep is ongoing. Under this ideal situation, the genealogy among the derived alleles in the D group would be less structured with shortened basal branches, and their family size manifested by mutations would remain small in most cases. Recombination between the D and A groups can therefore make  ζ i positive only for restricted i values in 1 d d i i max and m i i m max d , provided that these maximum family sizes ( i max ± ) revealed by recombination are much smaller than m/2. The former case for 1 d d i i max can occur for mutations that originate from the A group, but are subsequently recombined with the derived core alleles. On the other hand, the latter case for m i i m max d can occur for mutations that are shared by all the lineages in the D group (Fig. 1). In either case, the recombining unit size must be identical to the number of descendant lineages of that recombining branch in the D group. Conversely, if a reciprocal relationship, such as   ] ] i mi , is observed, it can be considered as a manifestation of the family size within the D group. Unfortunately, this exact relationship can be observed only when the population recombination rate of ρ = 4N e r is limited (N e and r refer to the effective population size and the recombination rate per core region per generation, respectively). It is possible for  ] m i ! 0 to be created even if i individual recombination events involve only singleton branches. As this is very unlikely for  ζ i with small values of i,  ] m i declines more slowly than  ζ i as i increases. Therefore,  ζ i with small values of i reflects the genealogical structure of IAV more accurately than the reciprocal  ] m i . In addition, the absolute value of  ζ i may differ greatly from that of  ] m i because the relevant mutations involved are asymmetric in number, even within the same A group. In fact,  ] m i ! 0 is produced only by mutations in the stem lineage of the D group, whereas  ] i ! 0 can be produced by a much wider range of mutations that have accumulated in the A group.
Overall, in the absence of recombination events, we have  ] i 0 exactly for 1 ≤ i < m, irrespective of the mode of selective sweep. Under a hard sweep, φ i,0 > 0 is concentrated in a narrow range of 1 ≤ i ≤ i max , whereas under a soft sweep, φ i,0 > 0 spreads over a wide range of i; however, there is a gray zone in between. In the presence of recombination,  Based on the different implications of φ i,0 and  ζ i outlined above, we divide the total number of segregating sites (S ζ ) within the D group into two components, S ζ0 and S ζ1 , as follows: (1 ≤ c r ≤ 8, depending on the derived core allele frequency in the 1000GD). However, no-one has studied φ i,0 and  ζ i or S ζ0 and S ζ1 , separately. We also divide the total number of derived alleles (L ζ ) within the D group into two components, L ζ0 and L ζ1 , as follows: ] ] ] M ] ¦ ¦ ¦ where L ζ is the sum of the scaled SFS (Fu, 1995) of ] ] i i i c within the D group and is expected to be small under strong selective sweeps. However, it is desirable to exclude the possibility of either background selection (Charlesworth and Charlesworth, 2012 and references therein) or an intrinsically low population mutation rate in a core region (θ = 4N e u, where u is the neutral mutation rate per region per generation). To have such internal control, we may also compute the number of derived alleles in the entire sample (  Edges and haplotype bars are in green for the D group and reddish-brown for the A group. Mutations are marked by colored stars along the edges of the graph. The red star shows a positively selected mutation at a core site. Similarly, the flesh-colored star shows a mutation that is specific to the D group but is affected by recombination. Two recombination events are highlighted by the two horizontal lines with double arrows wherein reciprocal recombinant haplotypes (above and below each arrow) segmentally keep the original edge and mutation colors.
standardize the IAV components in Formula (3b): . and (4) We emphasize that the unstandardized L ζ0 is also important. Under the assumption of a molecular clock, Thomson et al. (2000) and Tang et al. (2002) used L ξ /n as an estimator of the scaled TMRCA (ut G = θ T G , where T G = t G /4N e ) of a random sample of size n. We note that L ζ /n is the average number of mutations per lineage that have accumulated during t G generations. Hudson (2007) provided the variance formula, V ut G , of ut G . We can analogously estimate the scaled TMRCA (ut D = θT D , where T D = t D /4N e ) of the D group from the average number of mutations per lineage (C) that accumulated during t D generations. However, it is essential to exclude recombinant SNP sites so that we define the estimator as follows: Furthermore, following Hudson (2007) We can use either B + C or L ξ /n as an estimate of ut G . Under the standard model, we expect E{L ξ } = θ(n − 1) (Watterson, 1975;Fu, 1995), where E{ } is an expectation operator, and E{B + C} = θ (n − 1)/n. In Formulae (5a and b), we also note that u differs from region to region because u = μl, where μ is the mutation rate per site and l is the length of a (core) region. The ρ insensitivity of ut D and V ut D is crucial when applied to autosomal regions. This property will be examined later by coalescent simulations. Next, we are concerned with the average number of derived alleles per segregating site in three categories: Provided that there is at least one segregating site in each category, these G statistics are indices for the mul-tiplicity of derived alleles per segregating site within the D group. In particular, G c0 can be referred to, by definition, as the average family size in the D group. All three categories should be small under a hard sweep if recombination is rare. Conversely, relatively large values of G c0 and G c1 signify a soft sweep and recombination, respectively. The expected value of G ξ = L ξ /S ξ in the entire sample can be approximated by / , we have E{G ξ } ≈ (n − 1)/a n , which is greater than 100 for n = 1,000, or each segregating site in this large sample is represented by more than 100 derived alleles on average. Thus, the G statistics are rated "small" if they are much smaller than 100 in the 1000GD.
In addition to G c0 , we consider the following index of the IAV multiplicity: represents the probability of observing i derived alleles at a segregating site in a sample of size n (Kim and Stephan, 2002;Charlesworth and Charlesworth, 2012). G c0 is more sensitive to φ i,0 with large values of i than γ (i m ), although G c0 and γ (i m ) are closely related. For a hard sweep with f r = 0.6 or m = 600 in a sample of size n = 1,000, γ (i m ) = 0.1 will result in i m = 6, irrespective of ρ, and G c0 may become about 6-7. However, both G c0 and γ (i m ) depend strongly on the number of singletons (φ 1,0 ). If there are too many singletons, these values are kept small, which is undesirable for distinguishing between hard and soft sweeps. Therefore, we also examine and may interchangeably use an alternative γ * ( ) i m that excludes φ 1,0 from the definition given in Formula (8). Under the standard model, 1 , which become 0.59 and 0.69, respectively, if i m = 10 and m = 600. Similarly, we examine G c0 * that excludes singletons while calculating G c0 . As a soft sweep is defined by the presence of more than one distinct lineage (or haplotype) in the D group, we may impose a criterion in terms of relatively large values of γ * (10) and G c0 * . Many inference procedures for selective sweeps are based on differences among the unbiased estimators of θ (Tajima, 1989;Fay and Wu, 2000;Zeng et al., 2006Zeng et al., , 2017Ferretti et al., 2017). Under the standard model, we can obtain an estimate of θ as θ s = ξ 1 from the observed number of singletons (Fu and Li, 1993;Fu, 1995) or one of the following formulae: Among six possible pairwise combinations of the four estimators in Formulae (9), four difference statistics have been proposed: is the standardizing coefficient, which can be expressed as follows: with λ n and κ n specific to the Ω statistic (Table 4 in Ferretti et al., 2017, but with the numerator n − 2 in κ n D being read as n + 2, as in Tajima, 1989). To quantify IAV based on these statistics, we simply replace ξ k by ζ i and n by m in Formulae (9) and (10). We denote the latter as D ζ , H ζ , E ζ and K ζ . Under a hard sweep and in the absence of recombination, it is expected that θ w > θ π > θ L > θ H , and therefore, D ζ < 0, E ζ < 0, H ζ > 0, K ζ > 0. However, the mode of selective sweeps and the presence of recombination can greatly alter these expectations.
The case of ST8SIA2 Before going further, let us exemplify these statistics with actual data: the promoter region of the EAS ST8SIA2 locus (Fujito et al., 2018a(Fujito et al., , 2018b. The core site consists of three promoter SNPs and divides the whole sample into CGC and non-CGC haplotype groups. The core region is ca. 15 kb long (Chr 15, 92924515-92939701 in GRCh37), which is 1-3 kb shorter than that in the previous study (Fujito et al., 2018a) because of the exclusion of apparent recombinant sites located near the 5' and 3' ends. For the EAS sample in the 1000GD, we obtained the following values: We also obtained a small maximum family size of i max = 4 from φ 4,0 = 1 and φ i > 4,0 = 0 or γ (i m ) = 0 for i m ≥ 5, demonstrating the reduced IAV in this genomic region. Consistent with these findings, the number of segregating sites (S ζ ) within the CGC haplotype group is much smaller than S ξ = 132 in the entire sample. All the observed values of F c , L c0 , G c0 and D ζ are small at highly significant levels of P < 0.001 in the simulated standard model. We observed that the small F c value results in part from the exclusion of the derived alleles that belong to c r = 7, or more precisely in the frequency range of m i i m max d in Formula (2). If they had been included, F c would have been as large as L c . Therefore, the exclusion procedure in F c is legitimate because these derived alleles originated in the A group and were deceptively incorporated into the IAV by recombination. The G c1 and L c1 ≈ 40L c0 values show significant effects of recombination on the IAV. The negative D ζ and H ζ values are potentially useful to quantify an incomplete selective sweep, even though D ξ and H ξ for the entire sample lack their statistical powers ( Table S5 in Fujito et al., 2018a). D ζ < 0 reflects an excess of rare alleles and a lack of intermediate frequency alleles. H ζ < 0 reflects a lack of intermediate frequency alleles and an excess of high frequency alleles. In the present case, an excess of high frequency alleles results from recombination, unlike their gradual increase toward fixation by positive selection. This is reflected 3. Finally, the selective sweep for the CGC haplotype group is so strong and recent that we may estimate the TMRCA (t D = C/μl) of the CGC group as 12.2 ± 3.9 thousand years (ky) from C = 32/349 = 0.092, the mutation rate per site per year of μ = 0.5 × 10 − 9 (Scally and Durbin, 2012), and θ φ = 0.183 (B = 18 and L ξ /n = 15.6, a slightly smaller value than B + C). Although the CGC haplotype originated in Africa long ago, positive selection began to exert pressure on a single ancestral haplotype lineage in East Asia around 12 ky ago (Fujito et al., 2018b). Therefore, the core region of ST8SIA2 provides a solid example of a hard sweep using standing variations (Przeworski et al., 2005;Schrider and Kern, 2016), namely hardening soft sweeps (Wilson et al., 2014).

SIMULATED 2D SFS
Modes of selective sweeps and effects of recombination We present a simulated 2D SFS in Supplementary Fig. S1 for a small sample of n = 68, under the conditions of the presence or absence of recombination and hard or soft sweeps for the D group (ms by Hudson, 2002; discoal by Kern and Schrider, 2016). The methods for testing neutrality and examining statistical power are the same as described in Fujito et al. (2018a) and are not repeated here. An alternative representation of {φ i,j }, as presented in Table 1, demonstrates that in the absence of recombination, only φ i,0 , φ 0,j and φ m,j elements can have non-zero values, or polymorphism is represented only by the three edges of the matrix {φ i,j }. This is because the D group is always monophyletic with respect to the A group and φ i,n − m = 0 holds true for all possible values of i. Conversely, the presence of only three different types of gametes along with φ i,n − m > 0 is a definite indication of recombination. If the D and A groups are reciprocally monophyletic, we also expect φ m,j = 0 for 1 ≤ j < n − m. In any case, the D and A groups evolve "independently" in the absence of recombination. It is recombination that for some values of i in 1 ≤ i < m and requires a 2D SFS to completely describe the pattern and level of polymorphism for incomplete selective sweeps. Figure 2 shows that for hard sweeps, the expected values of φ i,0 and  ζ i divided by the number of their respective singletons decrease more rapidly than those of 1/i expected in the standard model (Fu, 1995) as i increases. φ i,0 and  ζ i are subcritical, and they indicate that alleles in the D group do not manifest large family sizes of lineages that share particular mutations. The declining pattern of  ζ i depends on the recombination rate ρ. Contrastingly, φ i,0 is almost unchangeable by recombination. This remarkable robustness results from the fact that although recombination may occasionally contribute to φ i,0 > 0, its effect is restricted to i in 1 ≤ i ≤ i max for some small number of i max . In fact, φ i,0 > 0 for large values of i can be realized only by particular mutations in an ancestral recombination graph (Griffiths, 1991;Griffiths and Marjoram, 1997) together with fortuitous loss or no sampling of recombinant haplotypes in the A group. Under ordinary circumstances, φ i,0 therefore remains unaffected even by frequent recombination. Thus, t D and the variance in Formula (5)    For a selective sweep, we used discoal  for n = 68, m = 34, θ = 100, a = 2N e s = 400, f r = 0.5 and ρ = 0. The subscripts i and j represent the numbers of derived alleles in the D and A groups, respectively, and k = i + j.
longer reflecting the structure accurately. In fact,  ] i ! 0 for m i i m max d can occur much more frequently than 1/i, irrespective of f r , indicating that it is related to the recombination processes between the D and A groups and not the coalescent processes with mutations within the D group.
Statistical power To further quantify φ i,0 ,  ζ i and their related summary statistics, we performed extensive coalescent simulations. The results of these simulations are summarized semi-quantitatively in Table 2, which gives the empirical criteria to evaluate the significance of sweep signals, classify the mode of selective sweep, and assess the effects of recombination. The family size is independent of ρ, as it should be, but is strongly dependent on the mode and strength of the selective sweeps. Supplementary Table S1 presents 18 summary statistics for IAV under hard sweeps when θ = 20, n = 1,000, a = 2N e s = 500 (s: genic selection intensity), and f r = 0.3, 0.6 and 0.9. As mentioned above, φ i,0 is insensitive to ρ and so are the related S ζ0 , L ζ0 and G c0 . In contrast, none of the four ζ-based estimators of θ are insensitive to ρ. This sensitivity is also reflected in H ζ , E ζ and K ζ , all of which can take positive and negative values as ρ changes. However, there is an exception to this large fluctuation. D ζ can keep a negative value within the examined parameter range because it depends on θ w and θ π , and not on θ L and θ H , which are much more sensitive to ρ than the former two. In other words, recombination generates superficial signals on θ L and θ H through  ] i ! 0 for large values of i. These results suggest that we cannot use H ζ , E ζ or K ζ without knowing the accurate values of ρ in actual data analyses. Figure 3 shows the receiver operating characteristic (ROC) curves comparing the sensitivity (true-positive rate or power, 1 − β) and 1 -specificity (false-positive rate, α) of F c , L c , L c0 , G c and G c0 , as well as those of D ζ , H ζ , E ζ and K ζ . The former five are tested in left-tailed tests and the latter four are tested in two-tailed tests (Fig. 2 in Ferretti et al., 2017). Each ROC curve was constructed from 1,000 simulations under strong selection (a = 500) and 1,000 simulations under near neutrality (a = 1, f 0 = 0.5 and f r = 0.6) to facilitate random sampling of the derived alleles with a specified frequency (f r ). Thus, our ROC curves are conservative compared to those when pure neutrality is assumed as a null hypothesis, and a comparison between pure neutrality (by ms) and near neutrality (by discoal) indicates visible effects of a = 1 on the 2D SFS ( Supplementary Fig. S3). The area under an ROC curve, often denoted as AU(RO)C, measures the discriminative ability of a statistic. For hard sweeps (Fig.  3A) and with ρ = 0, F c , L c0 and G c0 can discriminate the hypotheses well and may be categorized as excellent, as a rule of thumb. Among the four difference statistics, D ζ outperforms H ζ , E ζ and K ζ . However, if ρ = 50, L c0 and G c0 outperform F c , while H ζ and K ζ outperform E ζ and D ζ . Recombination reduces the power of F c because this statistic is not completely free from the effects of recombinant haplotypes. This is also reflected in the relatively low power of G c and L c . The powers of H ζ , K ζ and E ζ become even greater when ρ = 50 than when ρ = 0. This unexpectedly enhanced power results from the increased values of ζ i for large values of i because of the high recombination rate. Overall, for hard sweeps, (i) it is possible to infer the maximum and average family sizes among D group lineages; (ii) to evaluate IAV, it is recommended to use φ i,0 and closely related summary statistics, such as G c0 , L c0 and F c ; and (iii) the power of some difference statistics is extremely sensitive to recombination, and, thus, caution is required when interpreting the application results. Regarding (iii), an alternative may be to recalculate the difference statistics based on φ i,0 rather than ζ i .
A soft sweep can be modeled by a high initial frequency of the D group, such as f 0 = 0.1 (Ferrer-Admetlla et al., 2014;Kern and Schrider, 2016), and it generates SNP sites with φ i,0 > 0 for multiple intermediate values of i. Although effects of recombination on the pattern of  ζ i are similar to those of hard sweeps, IAV for   soft sweeps can be considerably high such that φ i,0 and  ζ i spread broadly in the 1 ≤ i < m range. For the soft sweep in Fig. 2B and in a specified range of 1 ≤ i ≤ 30, for example, we observed γ (10) ≈ 0.18 for ρ = 50, 20 and 0. These values are significantly larger than those for the hard sweeps shown in Fig. 2A, i.e., γ (10) ≈ 0.03 for ρ = 50, 20 and 0. Supplementary Table S2 presents the 18 statistics for soft sweeps. As mentioned above, φ i,0 -related statistics are remarkably robust to recombination, although this does not guarantee their ability to identify soft sweeps. In fact, the absolute values are substantially larger than those for the corresponding hard sweeps. With increased φ i,0 and  ζ i for intermediate values of i, all other statistics are significantly affected. The ROC curves for f 0 = 0.1, f r = 0.6 and ρ = 0 (Fig. 3B) show lower discriminative abilities of F c , L c0 and G c0 and almost no power of D ζ , H ζ , E ζ and K ζ for low cutoff values. Detecting soft sweeps strongly depends on the typically unknown parameter f 0 . If we reduce the value of f 0 from 0.1 to 0.05, the powers of some statsitics are recovered. If we increase the value of f r from 0.6 to 0.9, the powers of some statistics are recovered as well.
The  (Wilde et al., 2014), and the same is true for T and T D . T is a major factor that affects the power of a statistic wherein f 0 becomes a more crucial determinant than f r (for the case of f r = 1, see Innan and Kim, 2004;Przeworski et al., 2005). The increased extent of the power varies largely among the statistics used, i.e., F c , L c0 and G c0 are rather robust to changes in f 0 and f r , whereas D ζ and E ζ are sensitive, but H ζ is insensitive. Generally, the power of detecting soft sweeps of f 0 = 0.1 by a single SFS statistic may not be sufficiently high for a given low false-positive rate α. Readers may refer to Nielsen et al. (2005), Zeng et al. (2006Zeng et al. ( , 2007, Grossman et al. (2010), Ayub et al. (2013), Sheehan and Song (2016) and Schrider and Kern (2017) for improved methods that combine several population genetic summary statistics and related issues.
Past incomplete selective sweeps Any method would lose the ability to reject the null model in favor of a selective sweep with the time (τ in units of 4N e generations) elapsed since the fixation of a beneficial allele (Simonsen et al., 1995;Kim and Stephan, 2002;Przeworski, 2003). As sweep signals dissipate rapidly due to new mutations and recombination, the detectable period τ of a selective sweep may not greatly exceed 0.1 (Przeworski, 2003;Ferretti et al., 2017). In the case of modern humans, this may amount to 4,000 generations or 10 5 years if N e = 10 4 and the generation time is 25 years. Similarly, we are concerned with the possibility that positive selection would cease before a selective sweep is completed, presumably because of the ever-changing physical and/or cultural environment. We want to determine the smallest value of τ that can be statistically distinguishable from τ = 0 (ongoing sweep) and the statistic that is most suitable for this distinction. If τ > 0, the condition for a past hard sweep becomes T D ≤ T + τ, and that for a past soft sweep becomes T D > T + τ. During the time period of τ (forward in time), the D group frequency changes from f r to f r (τ) by genetic drift. Therefore, in a sample taken at τ, we observe the D group alleles with a number that may differ greatly from nf r = m and may range from 0 (loss) to n (fixation in the sample). Theoretically, a large number of coalescent events can occur even during a time period as short as τ = 0.05. Tavaré (1984) and Takahata and Nei (1985) derived two seemingly different formulae for the probability g n,k (2τ) of the number (k) of distinct ancestors 2τ units of time ago in a sample of size n wherein time τ is measured in units of 2N e generations (Chen, 2012 for the case of time-varying population sizes). Wakeley (2009) highlighted their equivalence, provided that the denominator in his formula (3.42) is read as k!(i -k)!n (i) rather than i!(i -k)!n (i) . g n,k (2τ) with n = 100 or more shows that k at τ = 0.05 would be concentrated between 15 and 25 ( Supplementary Fig. S4), thereby suggesting that these post-sweep coalescences substantially alter IAV. We examined F c , L c0 , G c0 and D ζ to determine how their ability to detect ongoing selective sweeps declines with τ, and we measured their power at τ = 0.01, 0.03 and 0.05 when a type I error rate α was specified as 0.01. Figure 4 shows that for hard sweeps, the F c power fades most rapidly, followed by L c0 . However, the G c0 power remains high for a relatively long time, and the D ζ power exbihits a complicated pattern. The primary change in these statistics results from the newly created φ i,0 > 0 for the relatively large values of i or increased i max due to coalescence in an elongated neutral phase of a post-sweep ( Supplementary Fig. S5). S ζ0 and L ζ0 increase proportionally with τ, as does L c0 = L ζ0 /L ξ ; however, G c0 = L ζ0 / S ζ0 increases more slowly and retains its power for a relatively long time. It is rather unexpected that G c0 does not easily distinguish between ongoing and past selective sweeps. In contrast, as the power of F c reduces to 50% or less even at τ = 0.01, it would be most useful to evaluate if τ > 0. The presence of recombination enhances this decline dramatically.
Overall, if we observe significantly small values of F c , L c0 and G c0 together with small values of i max and γ (i m ), we identify an ongoing hard sweep. If we observe significantly small values of i max , γ (i m ) and G c0 but relatively large values of F c as compared to L c0 , we identify a past hard sweep. The same conclusions are applied to the case of soft sweeps (Fig. 4). However, the powers of F c , L c0 , G c0 and D ζ for soft sweeps may be low even at τ = 0 and approach the specified level of α rapidly with τ. In hard and soft sweeps, recombination plays a significant role in diminishing the powers of F c and D ζ , whereas G c0 and L c0 are somewhat resistant.

APPLICATIONS TO HUMAN GENOMES
A genomic landscape revealed by 100 intronic SNPs In this section, we render a genomic landscape with a 2D SFS, although it is not exhaustive. By filtering biallelic SNP sites in the 1000GD by autosomal introns and Global MAF (ca. 40%), we randomly chose > 100 core sites, and examined LD and 2D SFS in the EAS population (see Supplementary Table S3 for their SNP reference

sequence (rs) numbers).
To facilitate the determination of a core region for each core site, we used a distance similar to the "half length" of LD (Reich et al., 2001), which is the distance from a core site at which LD drops below 1/2. However, we assumed a more tightly linked core region than expected by the half length. For this, we used r 2 (Hill and Robertson, 1968), took the average over SNP sites with frequencies greater than 0.1f r in a 1-kb non-overlapping window, and determined the distance at which the averaged r 2 value drops below 3/4. Two such distances on both sides of a core site mark the boundaries of the core region. The size determined in this way varies greatly among the core regions. In many cases, the averaged r 2 value decreases below 3/4 even in the first 1-kb window from a core site (for instance, rs1484214), and in other cases it extends over a 150-kb region (for instance, rs7541057). In total, we excluded 45 sites with < 1-kb core regions and added the same number of sites to keep the total number of 100 core sites. We confirmed that those 45 excluded SNPs do not show any signals of selective sweeps.
Assuming near neutrality of the 100 core variants, no recombination within a core region, and the nonequilibrium demography of Schaffer et al. (2005), we computed the P values of F c , L c0 , G c0 and D ζ , and examined the uniformity of their distributions by the chi-squared and Kolmogorov-Smirnov (KS with n = 100) tests as well as other multiple testing approaches (Fig. 5). The distribution for F c is fairly uniform and does not reject any core site ( F df 9 2 5 0 . with P ≈ 0.8 and g max = 0.089 with P > 0.05 in the KS test). Similarly, we obtained F df 8 2 13 95 . (P > 0.05) and g max = 0.144 (0.05 > P > 0.01) for L c0 , F df 9 2 15 0 . (P < 0.05) and g max = 0.106 (P > 0.05) for G c0 , and F df 7 2 19 1 . (P < 0.01) and g max = 0.151 (0.05 > P > 0.01) for D ζ . Thus, there are some indications against the null hypothesis. The multiple testing of F c , L c0 , G c0 and D ζ (Benjamini and Hochberg, 1995) consistently rejects zero, five, eight and eight core sites, respectively, with a 10% false discovery rate (FDR). Such different performances among the statistics result partly from their different sensitivities to recombination. The least sensitive are G c0 and L c0 , the second least sensitive is F c , and the Fig. 5. Histogram and cumulative distributions of 100 P values of F c , L c0 , G c0 and D ζ at biallelic intronic SNP sites randomly chosen from EAS in the 1000GD. A significant departure from uniform distribution occurs for P < 0.1 except for F c (see text). most sensitive is D ζ . If L c0 , G c0 and D ζ are used, at most one core site is erroneously rejected and the remaining four to seven support false null hypotheses. Thus, with a 10% FDR, 137 140 145 95 97~% = of intronic variants in human genomes are consistent with selective neutrality (Kimura, 1968). We also applied the same multiple testing to Fisher's combined P values of F c , L c0 and G c0 , and obtained 10 loci for P < 0.01. This is consistent with the result of the independent tests, but it is overstated because of positive dependence among the three statistics. Regarding the proportion π 0 of the true null hypotheses among all the hypotheses (Storey and Tibshirani, 2003;Benjamini et al., 2006), we estimated π 0 to be 0.90 from the multiple testing and histogram (Fig. 5). The most significant deviation from neutrality is observed at an intron variant C/T (rs1587725) that is located in a serine/threonine kinase 32 (STK32B) and associated with an autosomal recessive skeletal dysplasia. The C frequency in the African population is < 12%, but approximately 80% in the non-African populations.
It is worth noting that these conclusions are based on the null hypothesis, which assumes the absence of recombination in addition to neutrality. Recombination affects the distributions of F c , L c0 , G c0 and D ζ by decreasing S ζ0 and contrastingly increasing S ζ1 in IAV ( Supplementary  Fig. S6). Although such effects of recombination are small for the case of selective sweeps with low levels of IAV, they can be substantial for the case of neutrality with relatively high levels of IAV. In the case of neutrality, recombination increases the expected values of F c and D ζ so that their observed values become relatively small. For the same reason, recombination decreases the expected values of L c0 and G c0 or makes the distributions skew toward 0. Thus, in the presence of recombination, F c -and D ζ -based tests become too sensitive, but L c0 -and G c0 -based tests become conservative. It is difficult to choose an appropriate recombination rate when testing a null hypothesis. However, the fact that the P values of the four statistics are approximately uniformly distributed under the null hypothesis of no recombination suggests that the effect of recombination is not substantial in the core region with r 2 ≥ 3/4 despite a large number of segregating sites that indicate recombination. Simulated r 2 values for various recombination rates are presented in Supplementary Fig. S7.
Six selected regions We also examined six genomic regions encompassing the genetic loci of lactase (LCT) (Enattah et al., 2002;Tishkoff et al., 2007), ectodysplasin A receptor (EDAR) Fujimoto et al., 2008;Kimura et al., 2009;Kamberov et al., 2013), abnormal spindle-like microcephaly (ASPM) , microcephalin-1 (MCPH1) , neuregulin-1 (NRG1), and a disintegrin and metalloproteinase domain17 (ADAM17) (Pickrell et al., 2009). We selected these regions as representatives to test the power of our statistics. The six regions include two well-established cases, LCT (rs4988235) and EDAR (rs3827760), for positive selection, and two controversial cases, ASPM (rs41310927) and MCPH1 (rs930557), in relation to an LRH-based outlier approach (Yu et al., 2007) as well as allelic surfing . Further, to extend our previous study on ST8SIA2 (Fujito et al., 2018b), the regions include NRG1 (rs3924999) and ADAM17 (rs2709591), which are variants that are known to be associated with the risks of schizophrenia and various psychiatric phenotypes (Pickrell et al., 2009). The six core regions were determined in the same way as described in the 100-intron genomic landscape (Supplementary Fig. S8) and the observed values of several statistics are summarized in Table 3. First, we examined the level of the IAV through F c and L c0 , the multiplicity through γ * (i m ) and G c0 * , and the roles of recombination through G c1 and L c1 ( Table 2). The P values were computed in the null hypothesis of pure neutrality in the nonequilibrium demographic model of the EAS or European (EUR) population (Schaffner et al., 2005) with the number γ * (10), G c0 * 0 *** , 2.8 *** 0.08 *** , 5.5 *** 0.33, 13 0.27, 10.2 ** 0.31, 13.5 ** 0, 3.3 The P values of F c , L c0 , i max , γ * (10) and G c0 * for each gene were obtained by ms (Hudson, 2002) using the observed S ξ value and demographic model by Schaffner et al. (2005); *** P < 0.001, ** 0.001 ≤ P < 0.01, * 0.01 ≤ P < 0.05 (see Supplementary Table S3 for the full table). NS: no evidence for a selective sweep. of segregating sites being fixed to an observed one. Second, we estimated the TMRCA (t D ) in the D group from the observed C value, the mutation rate μ, and the 3/4 length (l) core region.
For LCT in the EUR population, the core region became l = 100 kb long (Chr 2, 136608646-136708564). If we instead use the 1/2 length, the core spans a 600-kb genomic region. When l = 100 kb, there are 76 segregating sites within the D group, which are 14% ( < f r = 51%) of the total in the same region. The IAV level is very low and provides the strongest evidence for a selective sweep. The estimated maximum and average family sizes are so small that the majority of the S ζ0 = 66 sites are singletons (G c0 3 * < ), and there is no segregating site with a multiplicity as high as 10 ≤ i or γ (10) ≤ γ * (10) = 0. Thus, the IAV in the T-13910 group-100% association with lactase persistence in the Finnish populationis unambiguously classified as an ongoing hard sweep (Table 2; Peter et al., 2012;Ronen et al., 2015). Nevertheless, there are additional S ζ1 = 10 segregating sites and L ζ1 = 2,056 derived alleles that are linked to the D group by recombination. This non-negligible effect of recombination is quantified by G c1 and L c1 . Further, we obtained t D = 3,280 ± 480 years from C = 0.16 as well as V ut D u 5 7 10 4 . with θ φ = 0.33 (B = 52 and L ξ /n = 55.8, a slightly larger value than B + C). As mentioned earlier, this estimate of t D for a hard sweep provides a lower bound of the time period (t) of the selected phase. The simulation showed that t can be much longer than t D (data not shown). Our estimate of t D is as short as a recent onset of positive selection in Bronze Age Europeans (Allentoft et al., 2015), shorter than 4,000 years based on ancient DNA (Mathieson et al., 2015), much shorter than 7,500 years based on the ABC approach (Itan et al., 2009), but within a wide range of allele ages roughly estimated by LRH conservation (Bersaglieri et al., 2004).
We chose the core site of EAS EDAR at the nonsynonymous variant site 1540. The 3/4 length core region becomes 65 kb long, while the 1/2 length core region spans a region of > 500 kb (Chr 2, 109465127-109530021). There are 290 segregating sites within the high-frequency D group (f r = 0.87), which are about 59% of the total in the same region. The IAV level is as low as L c0 = 0.0091, but F c = 0.257 is higher than 0.196 in our previous estimate based on a different core region (Fujito et al., 2018a). The IAV multiplicity with i max = 48, G c0 5 5 * .

≈
, and γ * (10) < 0.1 is low, but higher than that in LCT. Including the two sites with i = 120 and j = 48, and with i = 384 and j = 17, the family size in the D group is fairly large, and recombination apparently plays a substantial role in increasing G c1 and L c1 . From C = 0.60 and θ φ = 1.19 (B = 58 and L ξ /n = 58.3, in agreement with B + C), we estimated t D ≈ 18.5 ± 2.5 ky. The derived core allele may not have originated in the main ancestral populations of East Asians (Mathieson et al., 2015). Yet all these features are best classified as a hard sweep that has been operating at least over the past 18.5 ky (t > t D ).
The core region of ASPM in the EUR population is 165 kb long (Chr 1, 196990327-197155021). As this region is more than five times longer than that in our previous analysis (Fujito et al., 2018a), it includes a large number of segregating sites in the D group. The IAV level is inflated in terms of L c0 and F c . The IAV multiplicity is accordingly high and exhibits a fairly large family size (i max = 187, i max * = 143, G c0 13 * ≈ , and γ * (10) ≈ 0.33). Despite the relatively low frequency, f r = 0.41, significant roles of recombination are apparent. These features are consistent with previous work (Yu et al., 2007) that showed that the region is not identified empirically as an outlier in the LRH test. The estimates of C = 3.38 and θ φ = 5.91 (B = 113 and L ξ /n = 106, a slightly smaller value than B + C) led to t D ≈ 41 ± 7.9 ky.
Some genetic variants of the MCPH1 gene rapidly increased their frequencies in Eurasia. The 3/4 length core region in the EAS population is 32 kb long (Chr 8, 6286532-6318958) and there are 288 segregating sites (65% of the total) in the D group that show significantly low levels of F c and L c0 However, the core region exbihits relatively large family sizes (i max = 85, i max * = 76 , G c0 10 * ≈ , and γ * (10) = 0.27) and abundant recombinant haplotypes. These features are classified as an ongoing soft sweep with signficant effects of recombination. From C = 1.0 and θ φ = 2.0 (B = 64 and L ξ /n = 61.7, a slightly smaller value than B + C), we have t < t D = 62.5 ± 13.2 ky, indicating that the selection began to operate after the major out-of-Africa dispersal (Bae et al., 2017;Mondal et al., 2019). However, in the EUR population, the IAV pattern and level differ significantly. The core region becomes 30 kb long (Chr 8, 6284798-6309364), and it shows a high IAV level and multiplicity with large values of F c and L c0 as well as those of i max = 802, i max * = 9 , G c0 141 * ≈ , and γ * (10) = 0.47. i max has the largest possible value when m = 803, and even if this φ 802,0 = 4 is excluded, there still exist four sites of φ 801,0 = 3, φ 458,0 = 1, φ 343,0 = 1, and φ 340,0 = 1. The estimate of C = 8.7 in the D group (θ φ = 2.35, B = 37 but L ξ /n = 37.3) shows t D = 580 ± 180 ky, which corresponds to a time much earlier than the origin of modern humans (Bae et al., 2017). There is no indication of any selective sweep for the MCPH1-derived allele at rs930557 in the EUR population.
ADAM17 converts NRG1 to its active form in the NRG-ERBB4 signaling pathway, and both genomic regions exhibit iHS selection signals in the EAS population (Pickrell et al., 2009). The core region of ADAM17 is 26 kb long (Chr 2, 9490015-9515668). The IAV level is low in terms of F c and L c0 , but the IAV multiplicity is high (i max = 213, i max * = 3, G c0 13 5 * .

≈
, and γ * (10) = 0.31). The large value of i max results from the single occurrence of φ 213,0 = 1. If that is excluded, the value of i max drops substantially, but is still as large as 48. These features may well be classified as an ongoing soft sweep. Using C = 0.64 and θ φ = 1.16 (B = 24 and L ξ /n = 24.3, in agreement with B + C), we have t D = 49 ± 20 ky ( > t). On the other hand, the core region of NRG1 in the EAS population becomes only 3 kb long (Chr 8, 32451286-32454158), the shortest among the six genes and containing only 13 segregating sites (72% of the total) in the D group. The small core region shows a relatively small family size (i max = 7, i max * = 6 , G c0 3 3 * .

=
, and γ * (10) = 0), but it results in a broad distribution of any statistic and a large type I error rate. We have 17 ± 7.1 ky from C = 0.025 and θ φ = 0.05 (B = 2 and L ξ /n = 1.8, in good agreement with B + C). There is no statistically significant indication of a selective sweep for NRG1.
Of the 18 summary statistics mentioned in this paper, we have paid special attention to each of F c , L c0 and G c0 to infer significant signals of positive selection, the mode of selective sweeps, and the role of recombination. In certain circumstances, however, we may wish to evaluate the overall significance level, which is expressed by combined tests rather than individual ones. Ayub et al. (2013) used the aforementioned Fisher's combined method. Alternatively, we may follow Zeng et al. (2006Zeng et al. ( , 2007, who proposed a compound test for multiple statistics such as D ξ and H ξ , or Grossman et al. (2010), who developed a composite method of multiple signals. Of these three, as Fisher's method is straightforward, we used it here again even though the F c -, L c0 -and G c0 -based tests are positively correlated with each other and accordingly the small combined P value is generally overstated. We obtained F df 6 2 = 41.4 (P=2.4×10 −7 ) for LCT, F df 6 2 =34.1 (P=6.4×10 −6 ) for EDAR, F df 6 2 =29.3 (P=5.2×10 −5 ) for MCPH1, F df 6 2 = 25.6 (P = 2.7 × 10 − 4 ) for ADAM17, F df 6 2 = 13.3 (P = 0.038) for NRG1 and F df 6 2 = 10.2 (P = 0.117) for ASPM. Not surprisingly, these significance levels measured by the combined P values agree well with our conclusions based on individual tests (Table 3).
We compared the above results with those of nSL (Ferrer-Admetlla et al., 2014) using selscan software (version 1.2.0a) for common biallelic variants (MAF > 1%). Specifying an optional command of "--max-extendnsl" and checking the significant decay of the EHH curves, we set the stopping condition for nSL computations as 1,500 loci. We observed that the normalized nSL is significantly large for LCT (P < 0.0001), EDAR (P = 0.0003) and ADAM17 (P = 0.0096); marginal for NRG1 (P = 0.029); and non-significant for ASPM (P = 0.418) and MCPH1 (P = 0.263). These P values for LCT, EDAR, ASPM and ADAM17 agree with our results, but those for MCPH1 and NRG1 somewhat differ. The difference in MCPH1 probably results from different sensitivities to coupled effects of recombination and a soft sweep. As the MCPH1 core region exbibits the highest proportion (S ζ1 /S ζ ) of segregating sites that underwent recombina-tion (Table 3), nSL may be confused by the high frequencies of recombinant haplotypes. The difference in NRG1 likely results from a small core region that reduced the detection power of our method.

CONCLUSIONS
In this study, we have shown that the pattern and level of IAV under an incomplete selective sweep can be satisfactorily described by 2D SFS. This allows the separation of segregating sites that have experienced recombination from those that have not. The former and latter classes of segregating sites are expressed by internal and one edge elements and by two or three marginal vectors of the 2D SFS matrix, respectively. Recombination affects genealogical relationships, but nevertheless reveals the family size of a mutation in a derived allele or haplotype group. Based on these features and findings, we have developed 2D SFS-based statistics that can have high power, be robust with respect to recombination and nonequilibrium demography, and be used to distinguish not only between hard and soft sweeps, but also between ongoing and past incomplete selective sweeps. However, any sample can lack indispensable recombination signals, thereby leading to erroneous conclusions, particularly when distinguishing the effects of recombination from those of a soft selective sweep. By applying 2D SFS to 100 intronic regions randomly sampled from the EAS population, we have concluded that about 96% of random core SNPs are selectively neutral (Kimura, 1968). In addition, among the six Eurasian genetic loci with known unusual patterns and levels of DNA polymorphism, we have classified two as ongoing hard sweeps, two as ongoing soft sweeps, and two as nonsignificant. We have also demonstrated that 2D SFS is useful to date an incomplete selective sweep in relation to the TMRCA in a selected allele group. Command line: ./discoal2 1000 1000 16000 -t 20 -ws 0 -x 0.5 -c 0.30 -a 500 -r .

Supplementary
Supplementary