Genes & Genetic Systems
Online ISSN : 1880-5779
Print ISSN : 1341-7568
ISSN-L : 1341-7568
Short communication
The amount of DNA polymorphism when population size changes linearly
Takuya TakahashiFumio Tajima
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2017 Volume 92 Issue 1 Pages 55-57

Details
ABSTRACT

Population size is one of the main factors that determine the amount of DNA polymorphism. We examined a model under which the population size changed linearly. Because of the simplicity of this model, we could analytically obtain the expectation of nucleotide diversity, E(π), and the expectation of the amount of DNA polymorphism, E(θ), based on the number of segregating sites. The results suggest that E(π) is larger than E(θ) when the population size decreased and that E(π) is smaller than E(θ) when the population size increased. The expected time to the most recent common ancestor could also be obtained under this model.

MAIN

The amount of genetic variation in a population is determined by many factors, such as the population size, mutation rate and natural selection. The amount of genetic variation at the DNA level (DNA polymorphism) can be estimated from the number of segregating sites and nucleotide diversity in a sample of DNA sequences. The number of segregating sites is one of the most important indicators which quantify the amount of DNA polymorphism (Watterson, 1975). This number as well as nucleotide diversity is known to be influenced by change in population size (Tajima, 1989b). The distribution of pairwise nucleotide differences among a sample of DNA sequences is also known to be influenced by change in population size (Slatkin and Hudson, 1991).

Tajima (1989b) has studied the case where the population size changed drastically. On the other hand, Slatkin and Hudson (1991) have examined the exponential model of population growth and Fu (1997) has examined the logistic model of population growth. To better understand the effect of change in population size on the amount of DNA polymorphism, we consider a new model in which the population size changes linearly. Because of the simplicity of this model, we can analytically obtain the expected number of segregating sites and the expectation of nucleotide diversity by investigating the branch lengths in the genealogical tree.

We consider a random mating population currently consisting of N(0) diploid individuals. In this model we assume that the population size changes linearly. Namely, we assume that the population size was N(t) = N(0) − αt t generations ago. α > 0 corresponds to an increasing population and α < 0 corresponds to a decreasing population. We also assume that the DNA sequence of interest has many nucleotide sites, that no DNA recombination takes place within the sequence, and that every newly arisen mutation is selectively neutral. In this study we use the coalescent theory or the theory of gene genealogy (Kingman, 1982; Hudson, 1983; Tajima, 1983; Wakeley, 2008).

To calculate the expectation of the number of segregating sites (Sn) in a sample of n sequences, we examine the expected length of each branch which appears in the genealogical tree. Let Pn(t) be the probability that n sequences had n−1 common ancestral sequences t generations ago and the divergence took place t−1 generations ago. Then, this probability is approximately given by   

P n ( t ) = ( n 2 ) 2N( t ) i=1 t-1 ( 1- ( n 2 ) 2N( i ) ) . (1)
Further approximation gives   
P n ( t ) = ( n 2 ) 2N( t ) exp( - 0 t ( n 2 ) 2N( x ) dx ) . (2)
Substituting N(t) = N(0) − αt into (2), we obtain   
P n ( t ) = ( n 2 ) 2 N ( 0 ) - ( n 2 ) 2α { N( 0 ) -αt } ( n 2 ) 2α -1 . (3)
Using this probability distribution, we can obtain the mean time until the first coalescence, which is given as follows:   
If   α0,E( t ) = 0 N( 0 ) α t P n ( t ) dt= 2N( 0 ) 2α+( n 2 ) (4a)
  
If   0>α>- ( n 2 ) 2 ,E( t ) = 0 t P n ( t ) dt= 2N( 0 ) 2α+( n 2 ) (4b)
  
If   α- ( n 2 ) 2 ,E( t ) = 0 t P n ( t ) dt= (4c)
It is noted that in this model the population size was zero N(0)/α generations ago when α > 0. Equation (4c) indicates that if α-( n 2 ) /2, E(t) is no longer definable. In the rest of this paper, we will only consider the case of α > −0.5, where α>-( n 2 ) /2 is satisfied for every possible n. As we shall see later, the method to obtain E(Sn) necessitates the calculation of E(t) for every n. Thus, this model cannot be applied to the case where the population decreased at an extremely rapid pace (α ≤ −0.5).

Sn depends on the sample size. When the sample size is two, the expected number of segregating sites can be obtained from E(t) for n = 2. Namely, we have   

E( S 2 ) =2v× 2N( 0 ) 2α+1 = 4N( 0 ) v 2α+1 , (5)
where v is the mutation rate per sequence per generation. The expectation of the average number of pairwise nucleotide differences (k) is the same as the expected number of segregating sites for n = 2, so that we have   
E( k ) = 4N( 0 ) v 2α+1 . (6)
Nucleotide diversity (π) is defined as the number of pairwise nucleotide differences per nucleotide site, so that the expectation of π is given by   
E( π ) = 4N( 0 ) u 2α+1 , (7)
where u is the mutation rate per nucleotide site per generation. When the sample size is three, we consider two events. In the first event, three sequences have two common ancestral sequences for the first time t generations ago. In the second event, these two sequences have one common ancestral sequence. Then, the expected number of segregating sites can be obtained from   
E( S 3 ) = 0 C { 4N( t ) v 2α+1 +3vt } P 3 ( t ) dt, (8)
where C is N(0)/α for α ≥ 0 or ∞ for 0 > α > −1/2. Noting N(t)= N(0) − αt and 0 C t P 3 ( t ) dt=2N( 0 ) /( 2α+3 ) , we have   
E( S 3 ) =4{ N( 0 ) - 2N( 0 ) α 2α+3 }v 1 2α+1 + 6N( 0 ) v 2α+3 =4N( 0 ) v( 3 2α+3 1 2α+1 + 1 2 3 2α+3 ) . (9)
In this way, we can obtain the expected number of segregating sites in a sample of n sequences, which is given by   
E( S n ) = 0 C { 4N( t ) v i=1 n-2 ( 1 i k=i+1 n-1 ( k 2 ) 2α+( k 2 ) ) +nvt } P n ( t ) dt =4{ N( 0 ) - 2N( 0 ) α 2α+( n 2 ) }v i=1 n-2 ( 1 i k=i+1 n-1 ( k 2 ) 2α+( k 2 ) ) + 2N( 0 ) nv 2α+( n 2 ) =4N( 0 ) v i=1 n-1 ( 1 i k=i+1 n ( k 2 ) 2α+( k 2 ) ) . (10)
The above equations clearly indicate that the proof of (10) can be obtained by mathematical induction. The amount of DNA polymorphism per nucleotide site can be estimated by   
θ= S n / ( m i=1 n-1 1 i ) , (11)
where m is the number of nucleotide sites in a sequence. Since u is equal to v/m, the expectation of θ is given by   
E( θ ) =4N( 0 ) u i=1 n-1 ( 1 i k=i+1 n ( k 2 ) 2α+( k 2 ) ) / i=1 n-1 1 i . (12)
Numerical examples of E(π) and E(θ) are shown in Table 1, where N(0) = 10,000 and u = 2.5 × 10−8 were used. When the population size was constant (α = 0), both E(π) and E(θ) are 0.001 as expected. When the population size decreased linearly (α < 0), both E(π) and E(θ) are larger than 0.001. On the other hand, both E(π) and E(θ) are smaller than 0.001 when the population size increased linearly (α > 0). We also notice that E(π) > E(θ) when α < 0 and that E(π) < E(θ) when α > 0. This suggests that D statistics (Tajima, 1989a) tend to be positive when the population size decreased linearly (α < 0) and negative when the population size increased linearly (α > 0). This is consistent with the results obtained from the sudden change model (Tajima, 1989b).
Table 1. Expected amounts of DNA polymorphism (π and θ) and the expected time to the MRCA in a sample of n sequences, where N(0) = 10,000 and u=2.5 × 10−8 were used
αE(π)E(θ)E(Tn)
n = 10n = 100n = 10n = 100
−0.30.002500.002320.00188105,730121,627
−0.20.001670.001590.0014066,34075,039
−0.10.001250.001220.0011547,17152,570
00.001000.001000.0010036,00039,600
0.10.000830.000850.0009028,77131,288
0.20.000710.000740.0008223,75825,578
0.30.000630.000660.0007620,10621,457
0.50.000500.000550.0006815,19415,983
10.000330.000390.000559,147.59,403.6
20.000200.000250.000434,911.94,956.7
50.000090.000130.000291,998.51,999.7

The expected time, E(Tn), to the most recent common ancestor (MRCA) in a sample of n sequences can be obtained in the same way as the expected Sn was obtained. When the sample size is two, from (4a) and (4b) we have   

E( T 2 ) = 2N( 0 ) 2α+1 . (13)
When the sample size is three, we have   
E( T 3 ) = 0 C { 2N( t ) 2α+1 +t } P 3 ( t ) dt =2{ N( 0 ) - 2N( 0 ) α 2α+3 } 1 2α+1 + 2N( 0 ) 2α+3 =2N( 0 ) ( 3 2α+3 1 2α+1 + 1 2α+3 ) . (14)
In this way, we can obtain the expected time to the MRCA in a sample of n sequences as   
E( T n ) = 0 C { 2N( t ) i=2 n-1 ( 1 ( i 2 ) k=i n-1 ( k 2 ) 2α+( k 2 ) ) +t } P n ( t ) dt =2{ N( 0 ) - 2N( 0 ) α 2α+( n 2 ) } i=2 n-1 ( 1 ( i 2 ) k=i n-1 ( k 2 ) 2α+( k 2 ) ) + 2N( 0 ) 2α+( n 2 ) =2N( 0 ) i=2 n ( 1 ( i 2 ) k=i n ( k 2 ) 2α+( k 2 ) ) . (15)
The above equations clearly indicate that the proof of (15) can be obtained by mathematical induction.

Numerical examples are shown in Table 1. When the population size was constant (α = 0), E(Tn) = 4N(0)(1−1/n) as expected (Tajima, 1983). When the population size increased very rapidly (α ≥ 2), the expected time to the MRCA is close to N(0)/α generations. As mentioned earlier, in this model the population size was zero N(0)/ α generations ago when α > 0.

Although the model presented in this paper might not be realistic, it may be useful to understand the effect of change in population size on the amount of DNA polymorphism. This model can be applied not only to increased populations but also to decreased populations.

REFERENCES
 
© 2017 by The Genetics Society of Japan
feedback
Top