2015 Volume 90 Issue 3 Pages 123-131
Mutations that have occurred in human genomes provide insight into various aspects of evolutionary history such as speciation events and degrees of natural selection. Comparing genome sequences between human and great apes or among humans is a feasible approach for inferring human evolutionary history. Recent advances in high-throughput or so-called ‘next-generation’ DNA sequencing technologies have enabled the sequencing of thousands of individual human genomes, as well as a variety of reference genomes of hominids, many of which are publicly available. These sequence data can help to unveil the detailed demographic history of the lineage leading to humans as well as the explosion of modern human population size in the last several thousand years. In addition, high-throughput sequencing illustrates the tempo and mode of de novo mutations, which are producing human genetic variation at this moment. Pedigree-based human genome sequencing has shown that mutation rates vary significantly across the human genome. These studies have also provided an improved timescale of human evolution, because the mutation rate estimated from pedigree analysis is half that estimated from traditional analyses based on molecular phylogeny. Because of the dramatic reduction in sequencing cost, sequencing on-demand samples designed for specific studies is now also becoming popular. To produce data of sufficient quality to meet the requirements of the study, it is necessary to set an explicit sequencing plan that includes the choice of sample collection methods, sequencing platforms, and number of sequence reads.
The whole-genome sequences of humans and great apes have been cataloged (Chimpanzee Sequencing and Analysis Consortium, 2005; International Human Genome Sequencing Consortium, 2004; Locke et al., 2011; Prufer et al., 2012; Scally et al., 2012). This sequence information has helped to clarify the evolutionary history of the lineage leading to modern humans in the last millions of years. Additionally, high-throughput sequencing technologies have significantly advanced in the last five years, leading to the further encouragement of these studies at the whole-genome level. While these sequencers produce very short reads (e.g., 125 bp paired-end reads with the Illumina HiSeq SBS Kit v4), they generate hundreds of gigabase pairs of sequence information in a single sequencing run. Resequencing methodology, which utilizes the characteristics of high-throughput sequencers (DePristo et al., 2011; Stratton, 2008), has revealed the whole-genome sequence variants of more than 1,000 individuals of modern humans including a considerable number of rare variants (The 1000 Genomes Project Consortium, 2012). This massive amount of sequence information provides a fine-scale picture of the demographic histories of ancient modern humans during the last tens of thousands of years (Mathieson and McVean, 2014; O’Connor et al., 2015). In addition, resequencing techniques have successfully sequenced the genomes of archaic humans and ancient modern humans whose DNAs were extracted from fossils (Fu et al., 2014; Meyer et al., 2012; Prufer et al., 2014).
High-throughput sequencing has also clarified how mutations occurred in the human genome, directly providing a timescale for human evolutionary studies from the sequence data. Effective population size N and speciation time T are scaled by μ, the number of mutations per nucleotide per generation, which is estimated based on sequence comparison. The value of μ in the human genome had been estimated based on a phylogenetic approach, calculating nucleotide sequence divergence between orthologous sequences of human and chimpanzee with the fossil record as a calibration (Eyre-Walker and Keightley, 1999; Nachman and Crowell, 2000). On the other hand, high-throughput sequencing directly counts the number of mutations per nucleotide per generation by calling de novo mutations in the human germline based on a whole-genome comparison between parents and children (Campbell et al., 2012; Conrad et al., 2011; Kong et al., 2012; Michaelson et al., 2012; Roach et al., 2010). These studies have also updated our understanding of the tempo and mode of de novo mutations, leading to an elucidation of aspects of human evolutionary history ranging from speciation between humans and great apes to mitosis and meiosis in the germline taking place at this moment.
High-throughput DNA sequencing is becoming a core technology for genetic studies owing to the rapid increase in its throughput and the progress of in silico post-sequencing analysis methods (Koboldt et al., 2013). In addition, a dramatic reduction in sequencing cost facilitates on-demand high-throughput sequence production optimized to an experimental plan. Here, I introduce recent progress in our understanding of the tempo and mode of spontaneous mutations in the human and great ape genomes based on high-throughput sequencing. I then review the achievements of human evolutionary studies based on massive sequence information. Finally, I consider how we handle large-scale sequence information for human evolutionary studies.
Analysis of the enormous amount of data yielded by high-throughput sequencers has demonstrated newly occurring mutations that were identified by comparing the genome sequences of parents and offspring. Differences between these genome sequences are formed by de novo mutations, which arose in the parents’ germlines and were inherited by their children. Based on such pedigree genome sequencing, the de novo mutation rate in the human nuclear genome was estimated as approximately 0.96−1.3 × 10–8 per nucleotide per generation (Table 1) (Besenbacher et al., 2015; Campbell et al., 2012; Conrad et al., 2011; Kong et al., 2012; Michaelson et al., 2012; Roach et al., 2010). This mutation rate is half the rate estimated from a molecular phylogenetic approach using processed pseudogenes of human and chimpanzee (2.5 × 10–8 per nucleotide per generation; Table 1) (Nachman and Crowell, 2000).
Nachman and Crowell (2000) estimated the mutation rate from the equation k = 2μT/g+4μN, assuming a nucleotide sequence difference between human and chimpanzee (k) of 0.0133, speciation divergence (T) time between these species at 5 million years ago, ancestral population size at speciation (N) of 10,000, and generation time (g) of 20 years. However, recently discovered ancestral hominid fossils have pushed the speciation time back (see WHOLEGENOME SEQUENCES DISENTANGLE ANCESTRAL HUMAN EVOLUTIONARY HISTORY), and population genetic analyses at the whole-genome scale have indicated a larger ancestral population size than that used in the mutation rate estimation. In addition, the nucleotide divergence (k = 0.0133) included the insertion-deletion rate, whereas only single-nucleotide changes should be considered as nucleotide divergence for comparing the mutation rates between the phylogenetic and high-throughput sequencing approaches. These observations are likely to provide a lower mutation rate even with the phylogenetic approach. Indeed, assuming k of 0.012 after excluding indels, N ranging from 59,300 to 122,000 (Hara et al., 2012; Scally et al., 2012), T at 6 million years ago, and a generation time of 20–25 years, I calculated μ as 1.10–1.67 × 10–8, which is comparable with the estimates based on high-throughput sequencing (Table 1).
Prior to the study by Nachman and Crowell, a similar mutation rate (2.36 × 10–8 per nucleotide per generation) was estimated based on synonymous substitutions in the major histocompatibility complex loci of human, apes and Old World monkeys (Satta et al., 1993). The discrepancy of mutation rates between this study and the pedigree genome analyses can be explained by a decrease of the mutation rate in the great ape lineage, known as hominoid slowdown (Elango et al., 2006; Kim et al., 2006; Li and Tanimura, 1987). This variation in mutation rates is likely because of the longer generation time of hominoids than that of Old World monkeys (Gage, 1998; Langergraber et al., 2012), leading to an increase in the number of replication errors as the number of cell divisions in gametogenesis increases (Li and Tanimura, 1987). A comparison of whole-genome sequences among human, great apes and rhesus macaque indicated that mutation rates in the hominid lineages were approximately half that in the lineage of the Old World monkeys (Scally and Durbin, 2012). For these reasons, it is plausible to use the mutation rate based on pedigree data for elucidating human evolution within the last millions of years.
For identifying de novo mutations based on high-throughput sequencing, candidate mutation sites are usually treated with extreme caution by employing one-by-one validation such as Sanger resequencing and hybridization capture (Conrad et al., 2011; Kong et al., 2012), even when de novo mutations have been identified with highly deep coverage. This is because the sequencing error rate is still non-negligible (e.g., 8.16 × 10–4 per base pair; Roach et al., 2010). However, discrepancies in estimated mutation rates were observed between whole-genome and exome sequence analyses. Exome sequencing of hundreds of families showed a mutation rate (1.3–2.2 × 10–8 per nucleotide per generation; Table 1) (Neale et al., 2012; O’Roak et al., 2012; Sanders et al., 2012) that was higher than rates based on whole-genome sequencing. The difference can be explained by more frequent de novo mutations in coding regions than in intergenic regions, which reflect the higher mutation rate in GC-rich coding regions than in intergenic regions (O’Roak et al., 2012; Sanders et al., 2012).
The large contribution of paternal de novo mutations (Table 1) can be explained by the large number of opportunities for replication errors to occur during cell divisions in spermatogenesis. In addition, spermatids in older individuals have experienced more cell divisions than those in younger ones, leading to more mutations caused by DNA replication errors (Crow, 2000). These observations lead to the idea that the number of de novo mutations transmitted to the children’s genomes should increase with paternal age. A project in Iceland sequenced the whole genomes of 78 trios consisting of 219 individuals, which included families with hereditary Autism Spectrum Disorders (ASD) and schizophrenia. The results showed a positive correlation between numbers of de novo mutations in the offspring genomes and paternal age: two mutations arise per year and the number of mutations doubles every 16.5 years (Kong et al., 2012). This observation explained the increased risks of ASD and schizophrenia with increasing paternal age (Kong et al., 2012). The results are also consistent with the correlation between paternal age and the birth rate of children affected with chondrodystrophia, which was noted by Weinberg more than 100 years ago (reviewed in Crow, 2000). Although maternal age did not correlate with the point mutation rate, the project revealed that the numbers of recombinations in the offspring’s genomes increased with maternal age (Kong et al., 2004).
A similar mutation rate to that in modern humans was observed in chimpanzees (Table 1): whole-genome resequencing of a chimpanzee family showed a mutation rate (0.46 × 10–9 per nucleotide per year) equivalent to that of the human genome (Venn et al., 2014). On the other hand, the paternal contribution of de novo mutations in chimpanzee (89%), as well as the paternal age effect (three mutations per year), was relatively high compared to human (Table 1). The high bias of the mutation rate toward males may be because more sperm is produced in chimpanzees due to higher sperm competition (Møller, 1989).
Comparison of extant and ancient modern human genomes enables us to infer mutation rate based on molecular phylogeny over a shorter time scale than that of human and chimpanzee. Based on the inference of additional mutations in the genomes of extant humans compared to the genome of an ancient modern human, a male from a population that lived in western Siberia 45,000 years ago, the de novo mutation rate was inferred to be 0.44–0.63 × 10–9 per nucleotide per year (Table 1) (Fu et al., 2014). This result is consistent with those from the human pedigree genome sequencing (Table 1).
Single-cell genome sequencing, which was accomplished by whole-genome amplification, illustrated genomic variation across gametes as well as somites (Blainey and Quake, 2014). Eight single sperms, from a 40-year-old Caucasian, were collected and sequenced individually (Wang et al., 2012). The number of de novo mutations per nucleotide per generation based on the sperm genome sequencing was more than twice that calculated from human pedigree genome sequencing. This is because pedigree genome sequencing provides an averaged mutation rate of a parental diploid while sperm genome sequencing identifies mutations in a paternal haploid. In addition, the donor in the sperm genome sequencing (40 years old; Table 1) was older than the paternal donors in the pedigree sequencing experiments (up to 33.6 years old; Table 1), which is likely to have resulted in more mutations in the sperm genomes. Based on the number of paternal mutations, I calculated the number of mutations per genome per cell division in spermatogenesis and found that these values were comparable among the sperm and pedigree genome sequencing experiments (Table 1).
The nucleotide mutation rate is not always homogeneous throughout the genome. A well-known example is the hypermutability of CpG dinucleotides, which have a 15-fold higher mutation rate than other sites (Elango et al., 2008), because of the spontaneous deamination of methylated cytosine in a CpG into thymine (Bird, 1980). Recently, non-random distribution of mutation rate across the human genome has been examined in detail based on high-throughput sequencing (Campbell et al., 2012; Kong et al., 2012; Michaelson et al., 2012). De novo mutations identified in the genomes of ten families, each composed of parents and a pair of identical twins, revealed that DNase I hypersensitivity, high GC content, predominant nucleosome occupancy, high recombination rate, and rich simple- and trinucleotide-repeats are associated with increased mutation rate (Michaelson et al., 2012).
Analyses based on large-scale sequence data have indicated that non-random variation of nucleotide mutation rates is also associated with genomic structural characters. Comparative genomics between humans and other primates revealed that neutral mutations were abundant in open chromatin (Martincorena and Luscombe, 2013) as well as highly transcribed regions (Park et al., 2012). In addition, mapping SNV distribution and replication timing to human chromosomes revealed that point mutations were abundant in genomic regions that are replicated late (Koren et al., 2012; Stamatoyannopoulos et al., 2009). One explanation for this is a shortage of time for DNA repair (Stamatoyannopoulos et al., 2009). Moreover, variation in mutation rates has been observed even between chromosomes. The high mutation rates of the mitochondrial genome and Y chromosome are well-known examples. The mutation rate of human mtDNA (1.67 × 10–8 per nucleotide per year; Table 1) (Soares et al., 2009) is more than ten times as high as that of the human nuclear genome, possibly because of the highly oxidative environment within the mitochondrion (Galtier et al., 2009). For the Y chromosome, an excess of cell divisions in male gametogenesis (Crow, 2000; Miyata et al., 1987) is thought to increase the mutation rate. In addition, nucleotide sequence divergence between humans and chimpanzees differs across the autosomes, implying that the mutation rate is different among the autosomes in the human and chimpanzee lineages (Hodgkinson and Eyre-Walker, 2011). This idea was supported by a coalescence analysis using the human and great ape genomes. While the averages of estimated τHC (= μTHC) and τHCG (= μTHCG), where THC and THCG represent speciation times between human and chimpanzee and between the ancestor of human/chimpanzee and gorilla, respectively, varied among the human chromosomes, the averages of τHC and τHCG of the human chromosomes were highly correlated (Fig. 1A) (Hara et al., 2012). One simple explanation for this observation is that mutation rates differ across chromosomes through mechanisms that have not yet been clarified (Hara et al., 2012).
Variation of genomic mutations across the human genome at the chromosomal level. (A) Variation of τHC (= μTHC) and τHCG (= μTHCG) across the human chromosomes, where THC and THCG represent speciation times between human and chimpanzee and between the ancestor of human/chimpanzee and gorilla, respectively, and μ represents mutation rate. Autosomes and sex chromosomes are colored in blue and orange, respectively. τHC and τHCG significantly correlated with each other (correction coefficient, R = 0.906; p-value = 2.58 × 10–9). This correlation is simply explained by the variation of mutation rate across the chromosomes assuming a simple allopatric speciation between human and chimpanzee. The figure is reproduced from Hara et al. (2012). (B) Ultramicro inversion rates across the chromosomes. A regression line was inferred using autosomes and the X chromosome. While the Y chromosome possessed an extraordinary inversion rate, the others showed a negative correlation between inversion rate and chromosome size (R = 0.650, p-value = 7.82 × 10–4). These ultramicro inversions were identified in whole-genome alignments between human and chimpanzee by an identification method modified from Hara and Imanishi (2011). The negative correlation between the ultramicro inversion rate and chromosomal size can be explained by higher rates of recombination, a plausible source of inversion, in smaller chromosomes (Jensen-Seaman et al., 2004): frequent recombinations may cause chromosome contraction, leading to size reduction of the chromosomes (Nam and Ellegren, 2012).
As well as point mutations, genomic structural mutation rates vary across the human genome. In the Y chromosome, for example, repeat-rich structure increases the structural mutation rate (Repping et al., 2006). Also, large-scale copy number variation (CNV) data demonstrated that CNVs are abundant in repeat motifs, recombination hotspots, tandem arrays of duplicated sequences (Campbell and Eichler, 2013), and the genomic regions that are replicated in the early and late S phases (Koren et al., 2012). In addition, ultramicro inversions, minute-scale inversions ranging from 5 to 125 bp, were identified from human-chimpanzee genome alignments, and their frequencies vary in different human chromosomes (Fig. 1B) (Hara and Imanishi, 2011). These ultramicro inversions are considered to be generated by homologous recombination during meiosis (Hara and Imanishi, 2011). A negative correlation exists between inversion rates and chromosomal size, which is consistent with higher recombination rates in smaller chromosomes in human and other mammalian genomes (Jensen-Seaman et al., 2004).
Sequence comparison based on even two reference genomes of closely related species can illustrate the process of speciation and the history of ancestral populations. This is because genomes are separated into regions by a large number of recombination events during evolution, and variation in nucleotide sequence divergence across the genome accounts for different demographic histories (Osada, 2014). Based on a coalescence theory, speciation processes of the lineage leading to humans have been inferred at multiple loci using sequences of three or more species. In these studies, population size and speciation time parameters, θ = 4μN and τ = μT/g, at all branching points are calculated simultaneously based on maximum likelihood, Bayesian and Hidden Markov Model frameworks. Recent progress in hominid genome sequencing has enabled us to infer human evolutionary history using the whole genomes of human and great apes (Hara et al., 2012; Locke et al., 2011; Mailund et al., 2012; Prado-Martinez et al., 2013; Scally et al., 2012; Yamamichi et al., 2012). Hara et al. (2012), for example, inferred speciation processes in the lineage leading to humans based on the MCMC framework by building whole-genome alignments among human, chimpanzee, gorilla and orangutan. In some of these studies, speciation times were scaled with the newly inferred mutation rate based on human pedigree genome resequencing (Scally and Durbin, 2012). Assuming a mutation rate of around 0.5 × 10–9 per nucleotide per year (Conrad et al., 2011; Lynch, 2010; Roach et al., 2010), the speciation time between human and chimpanzee was estimated as around 6–7 million years ago (Hara et al., 2012; Langergraber et al., 2012; Scally et al., 2012), which was consistent with the estimated speciation time based on the fossil record (Brunet et al., 2002, 2005). On the other hand, these speciation times were older than those from previous studies that employed a higher mutation rate based on phylogenetics (Hobolth et al., 2007; Locke et al., 2011; Takahata and Satta, 1997).
Coalescence analyses have also been used for examining processes of speciation. Some studies using whole-genome alignments of human and chimpanzee suggested simple allopatric speciation between the species (Hara et al., 2012; Yamamichi et al., 2012). This was incompatible with an introgression hypothesis for ancestral human and chimpanzee, proposed previously (Patterson et al., 2006), in which introgression was observed only in the X chromosome. However, the possibility of introgression remains open to discussion even when the human-chimpanzee speciation process is examined using the whole-genome sequences. Human-chimpanzee speciation scenarios were tested employing a Hidden Markov Model framework, and the results favored a model of gradual migration of the two new species after their initial separation (Mailund et al., 2012). On the other hand, the results of a recent comparative genomics between human and chimpanzee were incompatible with Patterson et al.’s introgression hypothesis described above (Dutheil et al., 2015).
Detailed inference of the demographic history of human populations is feasible owing to whole-genome sequencing of individuals. By tracing alignments of individual genomes from several human populations, analyses based on the pairwise sequentially Markovian coalescent model revealed the time to the most recent common ancestor of each locus separated by ancestral recombination events (Li and Durbin, 2011). This method also provided effective population sizes and their changes over time as well as the timing of bottlenecks (Li and Durbin, 2011). This approach has been applied to various studies for unveiling population histories in the hominoids, including analysis using archaic human genomes (Meyer et al., 2012; Pagani et al., 2015; Prado-Martinez et al., 2013; Prufer et al., 2014; Schiffels and Durbin, 2014).
Rare variants in the human genome, defined as a minor allele frequency of less than 1% in a population, often represent newly derived mutations and thus have been used for clarifying the demographic histories of modern humans (Coventry et al., 2010; The 1000 Genomes Project Consortium, 2012; Gravel et al., 2011). Although sequencing a substantial number of individuals is required to verify the rareness of the variants (Keinan and Clark, 2012), large-scale sequence projects targeting thousands of individuals have enabled rare-variant analyses. The 1,000 Genomes Project provided more than 20 million novel rare mutations, which have been widely used to infer population structures between and within populations as well as the demographic history of the populations (Colonna et al., 2014; The 1,000 Genomes Project Consortium, 2012; Gravel et al., 2011; Mathieson and McVean, 2014). Similarly, target resequencing of 202 genes in more than 14,000 individuals revealed that the degree of rare variants reflects the geography and demography of European populations (Nelson et al., 2012). Currently, exome sequencing is the best approach to sequence thousands of individuals with consistent coverage for certain regions across the genome (Fu et al., 2013; Tennessen et al., 2012). Sequencing of more than 6,500 human exomes revealed that most of the deleterious mutations in the coding regions emerged 5,000–10,000 years ago (Fu et al., 2013). This result suggested that deleterious mutations have been spreading rapidly in the human population due to its recent explosive increase. Another study using these exome data revealed the fine-scale population structure of European-American populations (O’Connor et al., 2015).
When conducting population genetics analyses using large-scale sequence information, it is necessary to evaluate the outputs in multiple ways. To identify variations or mutations based on genomic sequence comparison, assessments are required to see if the nucleotides aligned in a column are truly homologous. This is because mismatches in alignments do not necessarily correspond to mutations: they may be merely alignment errors. A large number of whole-genome alignment tools have been developed and recently benchmarked in a competition called Alignathon (Earl et al., 2014), where substantial differences were observed between the alignment methods using sequences artificially generated based on hominid phylogeny. This competition framework is expected to accelerate the improvement of alignment performances, leading to an expansion of alignable regions and thus identification of more mutations in the human genome. Post-alignment processing is also important to obtain valid genetic variation. The first and necessary step is to discard ambiguously aligned sites (e.g., aligned sites near gaps) that include abundant false positives of the variants. An optional step is to remove hypermutable regions such as CpG sites and apparently hypermutable sites that reflect genomic changes other than simple point mutations and indels. These filtering steps are also helpful for identifying genomic structural mutations such as ultramicro inversions and biased gene conversions (Capra et al., 2013; Hara and Imanishi, 2011). However, it is likely that the discarded sites will include considerable numbers of true positives and therefore that the results obtained from such filtered alignments will be underestimates. The de novo mutation rate based on the human pedigree genome sequencing is expected to be the lower limit due to the discarding of suspicious alignments (Campbell and Eichler, 2013). When comparative analyses are carried out using whole-genome alignments of human and great apes, this mutation rate will be still applicable. This is because ambiguously aligned sites, which can include true positive mismatches, need to be filtered from whole-genome alignments, and the filtered alignments are then modeled with the lower mutation rate rather than the true rate.
Large-scale sequence information from high-throughput sequencers is obtainable by sequencing original samples as well as by downloading data from public bio-archives. In both cases, the credibility of results relies on the validity of the sequencing strategy, from sample collection to sequencing platforms, as well as sequence quality. Sequence data based on facile experiments produce little information however the sequences are subsequently analyzed. For example, adequate numbers of individuals must be sequenced to confirm the rareness of variants (Keinan and Clark, 2012). To reduce problems caused by technical errors, all sampling for an exome study must be performed using the same target, and all libraries should be sequenced using the same sequencing platform (Clark et al., 2011; Tennessen et al., 2011). In addition, sequence coverage needs to be high to distinguish variants from sequencing errors (Clark et al., 2011; Tennessen et al., 2011).
Understanding variation of the mutation rate across the human genome leads to establishing evolutionary models for specific genomic regions as well as accurate identification of sites that are under natural selection. A mutation rate based on exome rather than whole-genome sequencing should be applied to evolutionary analyses that use coding regions because of the difference in estimated mutation rates. Some regions with extraordinarily high mutation rates such as indel hotspots may lead to an incorrect interpretation of an evolutionary event unless homoplasy is carefully considered (Kvikstad and Duret, 2014). For this reason, both tempo and mode of de novo mutations should be analyzed in detail for fine-scale evolutionary studies. Sequencing of cancer genomes is a possible candidate for such a purpose because of their high mutation rate (Pleasance et al., 2010). However, the mutation spectrum of cancer genomes is considerably different from that of the germline: SNV density in cancer genomes correlates with some features of somatic cell chromatin organization such as H3K9me3 modification levels, while SNP density in populations correlates with other features (Schuster-Bockler and Lehner, 2012). Therefore, a model system for collecting various germline mutations effectively should be established in the near future. This system will directly identify de novo mutations that occur during spermatogenesis, which can be separated from natural selection during mating and are an ultimate source in generating the diversity of life.