Genes & Genetic Systems
Online ISSN : 1880-5779
Print ISSN : 1341-7568
ISSN-L : 1341-7568
Short communications
AMAP: A pipeline for whole-genome mutation detection in Arabidopsis thaliana
Kotaro IshiiYusuke KazamaTomonari HiranoMichiaki HamadaYukiteru OnoMieko YamadaTomoko Abe
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML
Supplementary material

2016 Volume 91 Issue 4 Pages 229-233

Details
ABSTRACT

Detection of mutations at the whole-genome level is now possible by the use of high-throughput sequencing. However, determining mutations is a time-consuming process due to the number of false positives provided by mutation-detecting programs. AMAP (automated mutation analysis pipeline) was developed to overcome this issue. AMAP integrates a set of well-validated programs for mapping (BWA), removal of potential PCR duplicates (Picard), realignment (GATK) and detection of mutations (SAMtools, GATK, Pindel, BreakDancer and CNVnator). Thus, all types of mutations such as base substitution, deletion, insertion, translocation and chromosomal rearrangement can be detected by AMAP. In addition, AMAP automatically distinguishes false positives by comparing lists of candidate mutations in sequenced mutants. We tested AMAP by inputting already analyzed read data derived from three individual Arabidopsis thaliana mutants and confirmed that all true mutations were included in the list of candidate mutations. The result showed that the number of false positives was reduced to 12% of that obtained in a previous analysis that lacked a process of reducing false positives. Thus, AMAP will accelerate not only the analysis of mutation induction by individual mutagens but also the process of forward genetics.

MAIN

Whole-genome re-sequencing can now be performed in organisms whose genome sequencing has already been completed using high-throughput sequencing (HTS) technologies. Whole-genome re-sequencing enables the rapid identification of genes responsible for mutant traits in many model organisms, including yeast (Edwards and Gifford, 2012), zebrafish (Bowen et al., 2012; Obholzer et al., 2012), Caenorhabditis elegans (Minevich et al., 2012), Arabidopsis thaliana (Schneeberger et al., 2009; Ashelford et al., 2011; Austin et al., 2011; Uchida et al., 2011) and rice (Abe et al., 2012; Fekih et al., 2013). These tools have accelerated forward genetic studies to date.

In forward genetic studies, appropriate mutagens need to be selected for obtaining the mutants of interest. Chemical mutagens such as ethyl methanesulfonate (EMS) are widely used for inducing mutations and the above-described methods are suitable for detecting EMS-induced mutations, such as base substitution. On the other hand, various kinds of ionizing radiation, including fast-neutron and heavy-ion beam radiation, can be used as effective mutagens, and have traditionally been believed to induce diverse mutations, including base substitution, deletion, insertion and chromosomal rearrangement. These mutations can now be identified at the whole-genome level using HTS. In A. thaliana, fast-neutron-induced mutations were revealed to be mainly base substitutions and small deletions (Belfield et al., 2012). We have previously identified the mutation spectrum of the heavy-ion beam in A. thaliana as comprising base substitutions, deletions, insertions and chromosomal rearrangements (Hirano et al., 2015). The size of deletions increases with increasing value of the linear energy transfers (LETs) of heavy-ion beams (Kazama et al., 2011, 2013; Hirano et al., 2012). Whole-genome identification using HTS confirmed that base substitutions, deletions, insertions and chromosomal rearrangements were induced at the whole-genome level (Hirano et al., 2015). Thus, it is now possible to determine any type of mutation using HTS.

However, for detection of whole-genome mutations, different programs need to be used for each target mutation. Base substitutions or small insertions/deletions are detected by SAMtools (Li et al., 2009) and GATK (McKenna et al., 2010), while large deletions or chromosomal rearrangements are detected by Pindel (Ye et al., 2009), BreakDancer (Chen et al., 2009) and CNVnator (Abyzov et al., 2011). An additional problem in mutation detection using HTS is that the output lists of candidate mutations generated by these programs contain a number of false positives. The possible causes of false positives include mismapping of sequencing reads or SNPs between the original accession and the accession used in mutation induction. Belfield et al. (2012) and Hirano et al. (2015) confirmed all candidate mutations using a genome browser, but this is a very time-consuming process.

In this study, we have developed a novel pipeline, AMAP (automated mutation analysis pipeline), for conducting an integrated set of mutation analyses, mapping, removal of potential PCR duplicates and detection of mutations, using several programs. In AMAP, false positives are automatically determined by searching repetitive sequences near candidate mutations and by comparing lists of candidate mutations in sequenced mutants. We tested AMAP using HTS data that were previously analyzed by Hirano et al. (2015). HTS analysis using AMAP will allow the acceleration of forward genetics and gene function analysis.

AMAP consists of Perl scripts and requires the following software: BWA (ver. 0.6.2, Li and Durbin, 2009), Picard (ver. 1.114, http://broadinstitute.github.io/picard), RepeatMasker (ver. open-4.0.5, http://www.repeatmasker.org), SAMtools (ver. 0.1.19), GATK (ver. 3.2.2), Pindel (ver. 0.2.4t), BreakDancer (ver. 1.4.5), CNVnator (ver. 0.3) and SnpEff (ver. 3.6, Cingolani et al., 2012). The A. thaliana reference genome sequence (in FASTA format), gene sets data (in GTF format) and variation data (in VCF format) are also required and are available in EnsemblPlants (http://plants.ensembl.org).

The workflow of AMAP is shown in Fig. 1. AMAP is designed to accept paired-end read data generated from the HiSeq sequencing system (Illumina, Cambridge, UK). When sequencing reads in the FASTQ format obtained from multiple mutants are input into AMAP, mapping by BWA and removal of PCR duplicates by Picard are automatically performed, similar to that reported by Hirano et al. (2015). In addition, AMAP conducts realignment of reads using GATK to refine the mapping of reads. Mutation detection is then automatically conducted using SAMtools, Pindel and BreakDancer. AMAP also detects short indels and SNPs by GATK, and copy number variants by CNVnator. Mutation candidates in the mitochondrion or plastid are excluded. For results of GATK and SAMtools, known information about SNPs is added by SnpEff. The output files of all mutants generated from each program are merged into a single file. When trying to remove false-positive mutations automatically, it is possible that false-negative mutations are also removed. Thus, AMAP does not delete the estimated false positives but only adds flags to them so that they can easily be distinguished in the output files. In the SAMtools and GATK outputs, mutation candidates whose positions are covered by less than five or more than 1,000 reads are marked as false positives. In the GATK output, SNP candidates that failed to pass the filter “QUAL < 30.0 || QD < 5.0” and indel candidates that failed to pass the filter “QUAL < 10.0” or “MQ0 ≥ 4 && ((MQ0 / (1.0 * DP)) > 0.1” are also marked as false positives. In the CNVnator output, mutation candidates with a depth (average number of covered reads at both ends of the mutation region) of less than five or more than 1,000 are marked as false positives. In the Pindel and BreakDancer outputs, mutation candidates with a depth of less than five or more than 1,000, or those in which the ratio of the number of reads supporting the mutation to the depth is less than 0.1, are marked as false positives. Mutation candidates commonly detected in at least two mutants are evaluated by AMAP as false positives stemming from pre-existent polymorphisms, although AMAP works properly with a single mutant input without this function. AMAP also considers mutation candidates in or around (±10 bp) repetitive sequences that are detected by RepeatMasker as false positives. AMAP is available on GitHub (https://github.com/ion-beam-breeding/AMAP).

Fig. 1.

Flowchart of AMAP. Read sequences from multiple mutants were input in the FASTQ format. Mutation analyses by GATK, SAMtools, Pindel, BreakDancer and CNVnator were performed on each mutant. The results of each mutation analysis for all mutants were merged into a single TSV output file.

To test AMAP, the sequencing reads obtained from three mutants (Hirano et al., 2015) isolated after Ar-ion irradiation (50 Gy, LET = 290 keV μm) were input. In the test, to avoid different results due to differences in the versions of the programs used, the same versions of BWA (ver. 0.5.9), Picard (ver. 1.55) and SAMtools (0.1.16) as used in Hirano et al. (2015) were applied. The GATK and CNVnator outputs were confirmed by Integrative Genomics Viewer (IGV; ver. 2.3, Robinson et al., 2011).

Read files from the three Ar-ion-induced mutants that were previously analyzed by Hirano et al. (2015) were re-analyzed using AMAP. The resulting outputs generated by AMAP are shown in Tables 1 and 2. In the previous study, 16,521, 8,927 and 3,626 mutation candidates were detected by SAMtools, Pindel and BreakDancer, respectively, in an average of three mutants. However, only 149, 17 and 35 mutations that were detected by SAMtools, Pindel and BreakDancer, respectively, in a total of three mutants were confirmed by IGV, and 99.8% of the mutation candidates were false positives (Hirano et al., 2015). By contrast, AMAP generated outputs of 3,493, 23 and 60 mutation candidates in an average of the three mutants as outputs by SAMtools, Pindel and BreakDancer, respectively, leading to 12% of the total mutation candidates obtained in the previous study (Table 1). In addition, all the true mutations confirmed by Hirano et al. (2015) were detected by AMAP except for two mutations (possibly caused by the version upgrade of Pindel).

Table 1. Numbers of mutation candidates output by SAMtools, Pindel and BreakDancer, and the numbers after filtration by AMAP
Mutant lineSAMtoolsPindelBreakDancer
Without filtering*AMAP outputPMC** (%)Without filtering*AMAP outputPMC** (%)Without filtering*AMAP outputPMC** (%)
Ar-57-al116,1813,454217,712130.175,504350.64
Ar-365-as116,9393,5632110,336280.271,334574.3
Ar-443-as116,4423,462218,733280.324,040892.2
*  Calculated from Hirano et al. (2015).

**  Percentage of mutation candidates after filtration by AMAP in those output by each software.

Table 2. Numbers of mutation candidates output from GATK and CNVnator, and the accuracy rates of mutation detection
Mutant lineGATKCNVnator
BSDELINSAR* (%)DELDUPAR* (%)
Ar-57-al182 (66)180 (11)110 (3)2244 (0)4 (0)0
Ar-365-as141 (19)182 (15)134 (2)1014 (0)5 (0)0
Ar-443-as184 (24)187 (12)110 (5)1111 (2)7 (4)33

Numbers in parentheses indicate mutation candidates confirmed by IGV. BS: base substitution; DEL: deletion; INS: insertion; DUP: duplication.

*  Accuracy rate: the percentage of the numbers of mutation candidates visually confirmed using IGV in those output by AMAP.

Detection of SNPs and short indels with GATK was not performed in the previous study (Hirano et al., 2015). Thus, we confirmed all mutation candidates detected by GATK using IGV. In the current analysis, 156 mutations were detected by GATK (Supplementary Tables S1 and S2), and 125 of these were identical to the mutations identified by SAMtools (Supplementary Table S1). The other 31 mutations were identified exclusively by GATK (Supplementary Table S2). On the other hand, SAMtools identified 24 mutations that were not detected by GATK in this study (Supplementary Table S3). The differences of the detected mutations may be due to differences in the algorithms of the two programs. Parameter tuning in each program by users to fit their own sequence data may minimize the detection of program-specific mutation candidates, although it may also increase the number of false positives.

CNVnator detected copy number variations in the Ar-443-as1 mutant. Confirmation of the copy number variations by IGV revealed that all the copy number variations obtained were identical to those detected by Array-CGH in Hirano et al. (2015). In addition, a heterozygous deletion in the region 10,285,036–10,307,586 on chromosome 5 was identified, which could not be detected by either Pindel or BreakDancer. Thus, incorporation of CNVnator into AMAP improves the efficiency of mutation detection. We checked mutation candidates detected by GATK and CNVnator using IGV and confirmed that 14% and 7% of candidates were positive (Table 2).

In this study, we developed the new pipeline AMAP, for mutation detection at the whole-genome level, which integrates a set of well-validated open access programs. AMAP enables the reduction of false-positive mutation candidates to 12% of those reported in a previous study (Hirano et al., 2015). This reduction gives us high-throughput detection of whole-genome mutations. AMAP can analyze the sequencing reads derived from back-crossed populations to carry out mutation induction and mapping of genes responsible for the mutant phenotype as described earlier (Ashelford et al., 2011; Uchida et al., 2011). Moreover, AMAP can be applied to other model organisms, if their reference sequences are available. These techniques will accelerate forward genetic studies. Finally, it should be mentioned that candidate mutation output by AMAP still included false positives. Therefore, confirming the mutation using a genome browser is still required. Also, the possibility of false negatives should be considered because the same mutations may conceivably be induced independently in different mutants.

ACKNOWLEDGMENTS

This research was supported by the Council for Science, Technology and Innovation (CSTI), Cross-ministerial Strategic Innovation Promotion Program (SIP), “Technologies for creating next-generation agriculture, forestry and fisheries” (funding agency: Bio-oriented Technology Research Advancement Institution, NARO); by the Japan Society for the Promotion of Science (JSPS) through the ‘Funding Program for Next Generation World-Leading Researchers (NEXT Program)’ to T. A. (GR096) and through a Grant-in-Aid for Scientific Research (B) (Y. K., No. 25292009); by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) through KAKENHI (T. A., No. 221S0002); and by the RIKEN Biomass Engineering Program.

REFERENCES
 
© 2016 by The Genetics Society of Japan
feedback
Top