Genome Informatics
Online ISSN : 2185-842X
Print ISSN : 0919-9454
ISSN-L : 0919-9454
A Clustering Method for Molecular Sequences based on Pairwise Similarity
H. MatsudaT. IshiharaA. Hashimoto
Author information
JOURNAL FREE ACCESS

1996 Volume 7 Pages 23-32

Details
Abstract

This paper presents a method for clustering a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of the clustering, we use a fast approximation of sequence similarity (FASTA score). However, in the case to detect similarity between two sequences that are much diverged in evolutionary process, FASTA sometimes underestimates the similarity compared to the rigorous Smith-Waterman algorithm. Also the distance derived from the similarity score may not be metric since the triangle inequality may not hold when the sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a cluster of sequences with a confidence measure. We prove that a restricted version of the p-quasi complete graph problem (given a positive integer k, whether a graph contains a 0.5-quasi complete subgraph of which size≥k or not) is NP-complete. Thus we present the outline of an approximation algorithm for clustering a set of sequences into subsets corresponding to p-quasi complete graphs. The effectiveness of our method is demonstrated by the result of clustering Escherichia coli protein sequences by our method.

Content from these authors
© Japanese Society for Bioinformatics
Previous article Next article
feedback
Top