Genome Informatics
Online ISSN : 2185-842X
Print ISSN : 0919-9454
ISSN-L : 0919-9454
MUSCA: An Algorithm for Constrained Alignment of Multiple Data Sequences
Laxmi ParidaAris FloratosIsidore Rigoutsos
Author information
JOURNAL FREE ACCESS

1998 Volume 9 Pages 112-119

Details
Abstract

Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. MUSCA1 is a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems whose solutions are used to obtain the alignment of the sequences. We first discover motifs in the N sequences and then extract an appropriate subset of compatible motifs to obtain a “good” alignment. The motifs of interest to us are the irredundant motifs which are only polynomial in the input size. In practice, however, the number is much smaller (sub-linear). Notice that this step aids in a direct N-wise alignment, as opposed to composing the alignments from lower order (say pairwise) alignments and the solution is also independent of the order of the input sequences; hence the algorithm works very well while dealing with a large number of sequences. The second part of the problem that deals with obtaining a good alignment is solved using a graph-theoretic approach that computes an induced subgraph satisfying certain simple constraints. We reduce a version of this problem to that of solving an instance of a set covering problem, thus offer the best possible approximate solution to the problem (provided P≅NP). Our experimental results, while being preliminary, indicate that this approach is efficient, particularly on large numbers of long sequences, and, gives good alignments when tested on biological data such as DNA and protein sequences. We introduce the the notion of an alignment number K (2 ≤KN), a user-controlled parameter, that lends a useful flexibility to the aligning program: this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever possible, in the alignment. The usefulness of the alignment number is corroborated by the users who view this as a natural constraint while dealing with a large number of sequences.

Content from these authors
© Japanese Society for Bioinformatics
Previous article Next article
feedback
Top