抄録
It has been suggested previously that genomic sequences show characteristics typical of natural-language texts such as characteristics "signature" word usage. Thus, the algorithms from natural-language processing may therefore be used in genomic sequences as a mean of clustering and building of phylogenetic tree. Following this approach of natural-language processing, statistical N-gram analysis has been applied for comparative analysis of whole genome sequences of several organisms. It could be shown that a few particular DNA N-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. By counting the occurrences of different N-grams, one can define signature vectors of a genetic text, such as contrast value and usage departure. From the contrast value vectors, phylogenetic trees could be built by using linguistic similarity measures like correlation.