Genome Informatics
Online ISSN : 2185-842X
Print ISSN : 0919-9454
ISSN-L : 0919-9454
Evaluating Distance Functions for Clustering Tandem Repeats
Suyog RaoAlfredo RodriguezGary Benson
Author information
JOURNAL FREE ACCESS

2005 Volume 16 Issue 1 Pages 3-12

Details
Abstract

Tandem repeats are an important class of DNA repeats and much research has focused on their efficient identification [2, 4, 5, 11, 12], their use in DNA typing and fingerprinting [6, 16, 18], and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into groups or families based on sequence similarity so that their biological importance may be further explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by alignment. In this paper we evaluate five distance functions used to produce those alignments-Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is important to analyze and compare these functions because the choice of distance metric forms the core of any clustering algorithm. We employ a novel method to compare alignments and thereby compare the distance functions themselves. We rank the distance functions based on the cluster validation techniques-Average Cluster Density and Average Silhouette Width. Finally, we propose a multi-phase clustering method which produces good-quality clusters. In this study, we analyze clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans Chromosome III.

Content from these authors
© Japanese Society for Bioinformatics
Next article
feedback
Top