Proceedings of the symposium of Japanese Society of Computational Statistics
Online ISSN : 2189-583X
Print ISSN : 2189-5813
ISSN-L : 2189-5813
26
Conference information
N-gram spectral analysis on genomes and biological language models(Session 7C(IASC-ARS))
Yong Kheng GOHFoo Weng LIMYean Ling LEO
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Pages 199-202

Details
Abstract
It has been suggested previously that genomic sequences show characteristics typical of natural-language texts such as characteristics "signature" word usage. Thus, the algorithms from natural-language processing may therefore be used in genomic sequences as a mean of clustering and building of phylogenetic tree. Following this approach of natural-language processing, statistical N-gram analysis has been applied for comparative analysis of whole genome sequences of several organisms. It could be shown that a few particular DNA N-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. By counting the occurrences of different N-grams, one can define signature vectors of a genetic text, such as contrast value and usage departure. From the contrast value vectors, phylogenetic trees could be built by using linguistic similarity measures like correlation.
Content from these authors
© 2012 Japanese Society of Computational Statistics
Previous article Next article
feedback
Top