Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT

Kazutaka Katoh; Kei-ichi Kuma; Takashi Miyata; Hiroyuki Toh

doi:10.11234/gi1990.16.22

Abstract

In 2002, we developed and released a rapid multiple sequence alignment program MAFFT that was designed to handle a huge (up to-5, 000 sequences) and long data (-2, 000 as or -5, 000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions (v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-i in this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise alignments. We discuss the improvement in accuracy brought by this extension, mainly using two benchmark tests released very recently, BAliBASE v.3 (for protein alignments) and BR, AliBASE (for RNA alignments). According to BAliBASE v.3, the overall average accuracy of L-INS-i was higher than those of other methods successively released in 2004, although the difference among the most accurate methods (ProbCons, TCoffee v.2 and new options of MAFFT) was small. The advantage in accuracy of [GL]-INS-i became greater for the alignments consisting of -50-100 sequences. By utilizing this feature of MAFFT, we also examined another possible approach to improve the accuracy by incorporating homolog information collected from database. The [GL]-INS-i options are applicable to aligning up to -200 sequences, although not applicable to thousands of sequences because of time and space complexities.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!