MicroRNAs are a class of short non-coding RNAs that contain approximately 22 nucleotides and play a regulatory role in RNA silencing and translational repression. miR-92 belongs to the miR-17-92 family and has a regulatory effect on cell proliferation, apoptosis, and expression of proto-oncogenes and tumor suppressor genes. However, its function in flatfish is unclear. In this study, we used farmed Japanese flounder, Paralichthys olivaceus, and showed that gata5 is a target gene of miR-92. Experiments on miR-92 overexpression indicated that gata5 and sox17 were downregulated, while the transcription level of ntl increased. By contrast, depletion of miR-92 resulted in increased gata5 and sox17 levels and reduced ntl level. Moreover, thiourea treatment indicated that miR-92 may inhibit the metamorphic development of Japanese flounder. Our study suggests that miR-92 regulates the fate of endoderm and mesoderm by controlling gata5.
As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.
The genus Ficus is striking for its species diversity, ecological significance, and its often species-specific relationship with coevolved pollinating fig wasps, which has long fascinated biologists. The three closely related and generally co-distributed dioecious species Ficus hispida, F. heterostyla and F. squamosa provide an ideal system for the study of speciation, hybridization (caused by pollinator sharing) and comparative phylogeography to infer historical biogeography. We aimed to develop microsatellite markers for these allied species to facilitate the outlined study investigations. A DNA library was constructed from one F. heterostyla sample, and 19 microsatellite loci were developed based on high-throughput sequencing. These markers showed relatively high polymorphism in all three fig species. The mean number of alleles per locus was 3.594–5.286, and the mean observed and expected heterozygosity were 0.469–0.546 and 0.467–0.528, respectively. Principal coordinate, STRUCTURE and AMOVA analyses revealed different degrees of genetic differentiation within species, and, despite some observed genetic admixture, indicated the presence of clear boundaries between different species. In summary, we successfully developed universal microsatellite markers for three closely related Ficus species. These markers will be of great value for investigating patterns of biodiversity among the species in this model system for coevolutionary studies.
Unsupervised machine learning that can discover novel knowledge from big sequence data without prior knowledge or particular models is highly desirable for current genome study. We previously established a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions, which can reveal various novel genome characteristics from big sequence data, and found that transcription factor binding sequences (TFBSs) and CpG-containing oligonucleotides are enriched in human centromeric and pericentromeric regions, which support centromere clustering and form the condensed heterochromatin “chromocenter” in interphase nuclei. The number and size of chromocenters, as well as the type of centromeres gathered in individual chromocenters, vary depending on cell type. To study molecular mechanisms of cell type-dependent chromocenter formation, we analyzed distribution patterns of occurrence per Mb of hexa- and heptanucleotide TFBSs, which have been compiled by the SwissRegulon Portal, and of CpG-containing oligonucleotides. We found Mb-level islands enriched for TFBSs and CpG-containing oligonucleotides in centromeric and pericentromeric regions on all human chromosomes except chrY. Considering molecular mechanisms for cell type-dependent centromere clustering, the chromosome-dependent enrichment of a set of TFBSs and CpG-containing oligonucleotides is of particular interest, since the cellular content of TFs and methyl-CpG-binding proteins exhibits cell type-dependent regulation. A newly introduced BLSOM, which analyzed occurrences of a total of 3,946 octanucleotide TFBSs compiled by the SwissRegulon Portal, has self-organized (separated) the sequences that are characteristically enriched in TFBSs and shown that these sequences are derived primarily from centromeric and pericentromeric constitutive heterochromatin regions. Furthermore, the BLSOM identified and visualized characteristic TFBSs that are enriched in these regions. By analyzing Hi-C data for interchromosomal interactions, the present study showed that the chromatin segments supporting the interchromosomal interactions locate primarily in Mb-level TFBS and CpG islands and are thus enriched for a wide variety of TFBSs and CG-containing oligonucleotides.
Recently, the prospect of applying machine learning tools for automating the process of annotation analysis of large-scale sequences from next-generation sequencers has raised the interest of researchers. However, finding research collaborators with knowledge of machine learning techniques is difficult for many experimental life scientists. One solution to this problem is to utilise the power of crowdsourcing. In this report, we describe how we investigated the potential of crowdsourced modelling for a life science task by conducting a machine learning competition, the DNA Data Bank of Japan (DDBJ) Data Analysis Challenge. In the challenge, participants predicted chromatin feature annotations from DNA sequences with competing models. The challenge engaged 38 participants, with a cumulative total of 360 model submissions. The performance of the top model resulted in an area under the curve (AUC) score of 0.95. Over the course of the competition, the overall performance of the submitted models improved by an AUC score of 0.30 from the first submitted model. Furthermore, the 1st- and 2nd-ranking models utilised external data such as genomic location and gene annotation information with specific domain knowledge. The effect of incorporating this domain knowledge led to improvements of approximately 5%–9%, as measured by the AUC scores. This report suggests that machine learning competitions will lead to the development of highly accurate machine learning models for use by experimental scientists unfamiliar with the complexities of data science.