N-gram spectral analysis on genomes and biological language models(Session 7C(IASC-ARS))

Yong Kheng GOH; Foo Weng LIM; Yean Ling LEO

doi:10.20551/jscssymo.26.0_199

抄録

It has been suggested previously that genomic sequences show characteristics typical of natural-language texts such as characteristics "signature" word usage. Thus, the algorithms from natural-language processing may therefore be used in genomic sequences as a mean of clustering and building of phylogenetic tree. Following this approach of natural-language processing, statistical N-gram analysis has been applied for comparative analysis of whole genome sequences of several organisms. It could be shown that a few particular DNA N-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. By counting the occurrences of different N-grams, one can define signature vectors of a genetic text, such as contrast value and usage departure. From the contrast value vectors, phylogenetic trees could be built by using linguistic similarity measures like correlation.

著者関連情報

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）