2015 Volume 90 Issue 1 Pages 43-53
Unsupervised data mining capable of extracting a wide range of information from big sequence data without prior knowledge or particular models is highly desirable in an era of big data accumulation for research on genes, genomes and genetic systems. By handling oligonucleotide compositions in genomic sequences as high-dimensional data, we have previously modified the conventional SOM (self-organizing map) for genome informatics and established BLSOM for oligonucleotide composition, which can analyze more than ten million sequences simultaneously and is thus suitable for big data analyses. Oligonucleotides often represent motif sequences responsible for sequence-specific binding of proteins such as transcription factors. The distribution of such functionally important oligonucleotides is probably biased in genomic sequences, and may differ among genomic regions. When constructing BLSOMs to analyze pentanucleotide composition in 50-kb sequences derived from the human genome in this study, we found that BLSOMs did not classify human sequences according to chromosome but revealed several specific zones, which are enriched for a class of CG-containing pentanucleotides; these zones are composed primarily of sequences derived from pericentric regions. The biological significance of enrichment of these pentanucleotides in pericentric regions is discussed in connection with cell type- and stage-dependent formation of the condensed heterochromatin in the chromocenter, which is formed through association of pericentric regions of multiple chromosomes.