Data mining played a crucial role from the early stage of cheminformatics research. This review traces its historical development referencing various author’s studies. They include works such as conceptual clustering, discrimination net, and graph mining. Examples of SAR studies are shown using the cascade model and Bayesian net. Some issues remaining for future studies are also discussed.
MassBank is a public repository of mass spectral data of organic chemical compounds analyzed by various methods of mass spectrometry. Mass spectra that organic chemical compounds of a wide variety of chemical structure were analyzed on high mass resolution, electrospray ionization tandem mass spectrometry (ESI-MS/MS) have been deposited on MassBank since 2006. ESI-MS/MS analyzes the fragment ions produced by collision-induced dissociation of the molecular ion. The present study reports the chemical annotations of ESI-MS/MS spectral data (i) by identifying the molecular formula of fragment ions, (ii) by specifying the chemical bonds of the molecular ion dissociated to fragment ions, and (iii) by showing the dissociation sequences from the molecular ion to fragment ions by Ojimatrix. These three chemical annotations gave chemically consistent interpretations on the dissociation of the molecular ion to fragment ions. Based on the chemical annotations, the relationships between fragment ions and chemical substructures were analyzed. The specificity of the relationships is discussed. The chemical annotations and the relationships in the present study were summarized on the database, Metabolomics.jp, on the MediaWiki system in order to welcome the discussions on the present chemical annotations from other researchers studying mass spectrometry. The relationships were implemented to MassBank to give its users the “Metabolite prediction” service that predicts chemical substructures embedded in the chemical structure of unknown chemical compounds from user’s ESI-MS/MS spectral data.
Partial annotation of metabolite structures based on a product ion spectrum acquired by tandem mass spectrometry (MS/MS) has been a technical bottleneck in metabolomics. Regular expression of MS/MS data using a text representation of an MS/MS spectrum is an approach to search for structurally similar metabolites. The regular expression was also applied to describe spectral motifs for partial annotation and characterization of metabolite structures using the corresponding ontology codes such as produced by ChEBI.
Data mining techniques such as machine learning have greatly advanced the chemical and biological sciences. Especially, technological advances in data mining are anticipated for analyzing big data derived from biological and environmental systems. From this perspective, we analyze the complex metabolic and microbial responses of human skin and the relations among these responses using advanced data mining techniques. To this end, metabolic profiles of human sweats were characterized via multiple NMR spectra, followed by an advanced analytical strategy based on data-driven and machine learning approaches. These methods extracted the important variables of the metabolites associated with microbial community variations. Moreover, the relation between the sweat metabolites and the skin microbes was successfully visualized by correlation-based networks. This analytical strategy promises a versatile and useful approach for big data analyses in various fields of science.
In recent years, consumers’ interest in health foods has increased significantly. Among these health foods, fermented foods are used traditionally in Japanese food culture and have contributed to the maintenance of people's health. Recently, the biological effects of fermented brown rice and rice bran by Aspergillus oryzae (FBRA) have been comprehensively studied, and inhibitory effects on carcinogenesis have been reported. Regarding the bioactive chemical constituents in FBRA, the involvement of ferulic acid on the biological activity has been reported. In this study, we quantitatively investigated the dependence on fermentation time of the production of ferulic acid and related compounds in FBRA. In addition, we analyzed the generation of aroma-active compounds by fermentation.
In a previous paper, we analyzed the amounts of ferulic acid and its derivatives produced in the fermentation of brown rice and rice bran by Aspergillus oryzae (FBRA). Ferulic acid and its derivatives are considered to be biologically active constituents in FBRA and the amounts of these compounds increase remarkably depending on the fermentation time. Another benefit of fermentation is that it is considered to increase the nutritional value of the food. In this study, we examined changes in the nutritional components, such as dipeptides and the free forms of water soluble vitamins, in FBRA using LC-MS analysis.
Lipidomics is an important research field that studies lipid species in various experimental materials. The chromatographic separation and detection of lipid species are often carried out by liquid chromatography-mass spectrometry, but the identification and structural estimation of lipid species are not easy because the available authentic standard compounds and reference mass spectra are limited. This constitutes a major bottleneck in gaining insights into the roles of each of the lipid species involved in biological processes. In order to overcome this problem, artificial (in silico) tandem mass spectral libraries containing a broad range of lipid species have been constructed with the aid of the fragmentation and rearrangement rules of individual lipid classes. Here, we introduce lipid identification tools powered by in silico tandem mass spectral libraries for lipid research.
Systematic representation of alkaloid biosynthetic pathways based on ring skeletons has been proposed because the skeleton nucleus of an alkaloid is the main criterion for determination in biosynthetic pathways. So the idea of ring skeletons was extended to apply classification of alkaloid compounds based on ring skeletons and to systematize alkaloid compounds and to examine the performance of this approach to predict biosynthetic pathways based on module elements. We constructed a 2-dimensional binary matrix corresponding to 2546 SRS and 478 pathway-known alkaloid compounds. Here, if ith substring skeleton is present in a target compound, the ith element was set to 1; otherwise, the ith element was set to 0. Relationship of alkaloid compounds with biosynthetic pathways are examined based on the dendrogram produced by Ward clustering method to the matrix. Of 12,243 alkaloid compounds accumulated in KNApSAcK Core DB (http://kanaya.naist.jp/knapsack_jsp/top.html), 3,124 compounds (25.5 %) correspond to the pathway-known ring skeletons (187 ring skeletons), but the remaining 9,119 (74.5%) compounds do not. By examining the sub-ring skeleton similarity of the remaining compounds, it might be possible to obtain clues of pathway information and systemization of all alkaloid compounds. Therefore, the present work focuses on comprehensive systematization of the alkaloid compounds and construction principles of ring skeletons in alkaloids based on subring skeleton profiling.
Modern world is incorporating highly connected heterogeneous data due to information sharing through computer and communication technology. These data lead to a complex relation where drilling down and mining are needed for understanding the actual meaning of data. Today any modern computational technique uses graph clustering as a sophisticated technology for data analysis. In this paper we implement a generalized graph clustering algorithm DPClusO with easy operating procedure and clear visualization techniques. DPClusO is enhanced version of DPClus algorithm where overlapping property of clusters is taken into consideration along with density and periphery tracking. User can select different parameters and visualization attributes to render cluster set, single cluster, hierarchical graph etc. and save these data in image and text formats. This paper discusses step by step operation of the proposed software tool using an example network of metabolites collected from KNApSAcK database. This tool successfully generated cohesive groups of structurally similar metabolites. The tool can be used for analysis of network data of any field of studies.
It has long been investigated and understood that centrality of proteins in the context of protein-protein interaction (PPI) networks are related to their essentiality. In the present work, we validate the relations between essentiality of yeast proteins and their centrality measures in a PPI network by following a different approach using the concept of the receiver operating characteristic (ROC) curve. We found that all centrality measures are related to essentiality. However, the degree centrality performed better in case of the data we used. By deeply examining different centrality values of yeast proteins we find that they are not highly correlated, which has leaded us to hypothesize that centralities might have some relations with gene/protein functions. Indeed, we found that many of the clusters generated based on the pattern of centrality values are rich with similar function proteins. Different types of centrality values imply different types of importance of a node in a network and the functions of genes are of various types. In the present work, we hypothesized that important genes of different functions may tend to show different patterns of centralities and here we show some preliminary links between groups of similar function genes and profiles of centrality values. The concepts of network biology discussed in this paper are applicable to other networks including networks of chemical compounds.
For the early detection of temporal lobe epilepsy and prognosis, the hippocampus is targeted in the diagnosis by positron emission tomography using fluorodeoxyglucose (FDG-PET). The PET image is superior to know the functional information; however, it is hard to distinguish small structures. One of the methods to analyze PET images structurally is statistical analysis by the anatomical standardization. In this standardization, non-linear transformation is used to fit each brain structures to the template. Due to the non-linear transformation, especially in the case of relatively small structures, there is a possibility that displacement errors occur. In this study, after extracting the hippocampal region using a magnetic resonance imaging (MRI), MR images and PET images are registered using rigid transformation, the relationship between glucose metabolism and the subject’s age and sex were investigated. As a result, the decrease in hippocampal volume due to the nomal aging was observed. In addition, the difference due to aging is found in gender., The women have a gradual downward trend in the volume and metabolism. Compared normal subject’s group with epilepsy patients, the difference are found both in volume and metabolism.
The identification of new compound-protein interactions has long been the fundamental quest in the field of medicinal chemistry. With increasing amounts of biochemical data, advanced machine learning techniques such as active learning have been proven to be beneficial for building high-performance prediction models upon subsets of such complex data. In a recently published paper, chemogenomic active learning had been applied to the interaction spaces of kinases and G protein-coupled receptors featuring over 150,000 compound-protein interactions. Prediction models were actively trained based on random forest classification using 500 decision trees per experiment. In a new direction for chemogenomic active learning, we address the question of how forest size influences model evolution and performance. In addition to the original chemogenomic active learning findings that highly predictive models could be constructed from a small fraction of the available data, we find here that that model complexity as viewed by forest size can be reduced to one-fourth or one-fifth of the previously investigated forest size while still maintaining reliable prediction performance. Thus, chemogenomic active learning can yield predictive models with reduced complexity based on only a fraction of the data available for model construction.
Recently, the fragment molecular orbital (FMO) method has attracted considerable attention as an electronic structure calculation scheme applicable to macromolecular systems. As a major advantage, a list of inter fragment interaction energies (IFIEs) are straightforwardly obtained from the FMO calculations. It has been well recognized that the IFIE-based analyses are useful to grasp the nature of interactions in the given target system in practical applications. However, there exists a severe limitation that the value of IFIE between covalently bonded fragments takes an abnormally large value (about -15 hartree), and this should degrade the usability of FMO calculations in several cases. In this paper, we examined a correction method to solve this problem, based on the fictitious dissociation processes.
Solvent dipole ordering virtual screening (SDO-VS) is a virtual screening method that focuses on the shape of the SDO region at the binding site of the protein. In SDO-VS, pseudo molecules (PMs) are generated to reproduce the shape of the SDO region. Compounds that have shapes (or volumes) similar to those of the PMs are then screened from a 3D structure database. The original implementation of SDO-VS involved PMs with only sp3-hybridized carbon atoms. However, utilization of sp2- and sp-hybridized atoms and/or small molecular fragments, in addition to sp3-hybridized atoms, is expected to provide more efficient screening. To this end, this study investigated the effect of sp3-, sp2-, and sp-hybridized atoms and phenyl rings as fragments for PM generation in the SDO-VS method. The screening efficiencies were compared with the original method for several drug target proteins. Overall, this new method improved screening efficiencies, as measured by the area under the curve of the corresponding receiver operating characteristic plots.