A Clustering Method for Molecular Sequences based on Pairwise Similarity

H. Matsuda; T. Ishihara; A. Hashimoto

doi:10.11234/gi1990.7.23

Abstract

This paper presents a method for clustering a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of the clustering, we use a fast approximation of sequence similarity (FASTA score). However, in the case to detect similarity between two sequences that are much diverged in evolutionary process, FASTA sometimes underestimates the similarity compared to the rigorous Smith-Waterman algorithm. Also the distance derived from the similarity score may not be metric since the triangle inequality may not hold when the sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a cluster of sequences with a confidence measure. We prove that a restricted version of the p-quasi complete graph problem (given a positive integer k, whether a graph contains a 0.5-quasi complete subgraph of which size≥k or not) is NP-complete. Thus we present the outline of an approximation algorithm for clustering a set of sequences into subsets corresponding to p-quasi complete graphs. The effectiveness of our method is demonstrated by the result of clustering Escherichia coli protein sequences by our method.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!