We describe a novel computer system directed to evaluate protein complex formation in a liquid environment. The relevant feature of the system is a potential function expressing the main thermodynamic and kinetic factors leading to protein interaction in solution. The protein interaction model expresses the interaction energy as basically composed of three forces: electrostatic (hydrogen bond), van der Waals, and hydrophobic. The latter is defined in function of the forces that the solvent molecules exert on the surface of the complex, and the van der Waals forces between the monomers and the solvent. The interaction model implemented in the system has proven a high discrimination ability between different protein dockings, scoring high those close to the observed crystal structures. These results have led to the establishment of the basic principles underlying protein interaction, which constitutes the main way of expression of the biological function of these macromolecules.
In this paper, we evaluated the complexity and accuracy of dicodon model for gene finding using Hidden Markov Model with Self-Identification Learning. We used five different models as competitors with smaller parametric space than the dicodon model. Our evaluation result shows that the dicodon model outperforms other competitors in terms of sensitivity as well as specificity. This result indicates that the dicodon model can not be represented by a combination of the pair amino-acid, the codon usage, and the G+C content.
Protein threading, a method employed in protein three-dimensional (3D) structure prediction was only proposed in the early 1990's although predicting protein 3D structure from its given amino acid sequence has been around since 1970's. Here we describe a protein threading method/system that we have developed based on multiple protein structure alignment. In order to compute multiple structure alignments, we developed a similar structure search program on massive parallel computers and a program for constructing a multiple structure alignment from pairwise structure alignments, where the latter is based on the center star method for sequence alignment. A simple dynamic-programming based algorithm which uses a profile matrix obtained from the result of multiple structure alignment was also developed to compute a threading (i. e., an alignment between a target sequence and a known structure). Using this system, we participated in the threading category (category AL) of CASP3 (Third Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction). The results are discussed.
Logistic regression (LR), discriminant analysis (DA), and neural networks (NN) were used to predict ordered and disordered regions in proteins. Training data were from a set of non-redundant X-ray crystal structures, with the data being partitioned into N-terminal, C-terminal and internal (I) regions. The DA and LR methods gave almost identical 5-cross validation accuracies that averaged to the following values: 75.9±3.1%(N-regions), 70.7±1.5%(I-regions), and 74.6±4.4%(C-regions). NN predictions gave slightly higher scores: 78.8±1.2%(N-regions), 72.5±1.2%(I-regions), and 75.3±3.3%(C-regions). Predictions improved with length of the disordered regions. Averaged over the three methods, values ranged from 52% to 78% for length=9-14 to≥21, respectively, for I-regions, from 72% to 81% for length=5 to 12-15, respectively, for N-regions, and from 70% to 80% for length=5 to 12-15, respectively, for C-regions. These data support the hypothesis that disorder is encoded by the amino acid sequence.
Disordered regions are sequences within proteins that fail to fold into a fixed tertiary structure and have been shown to be involved in a variety of biological functions. We recently applied neural network predictors of disorder developed from X-ray data to several protein sequences characterized as disordered by NMR (Garner, Cannon, Romero, Obradovic and Dunker, Genome Informatics, 9: 201-213, 1998). A few predictions on the NMR-characterized disordered regions were noted to contain “false” negative indications of order that correlated with regions of function. These and additional examples are examined in more detail here. Overall, 8 of 9 functional segments in 5 disordered proteins were identified or partially identified by this approach. The functions of these regions appear to involve binding to DNA, RNA, and proteins. These regions are known to undergo disorder-to-order transitions upon binding. This apparent ability of the predictors to identify functional regions in disordered proteins could be due to the existence of different flavors, or sub-classes of disorder, originating from the sequence of the disordered regions and perhaps owing to local inclinations toward order. These different flavors may be a characteristic that could be used to identify binding regions within proteins that are difficult to characterize structurally.
We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences. We then describe a theory of measuring the relatedness between two DNA sequences. Using our algorithm, we present strong experimental support for this theory, and demonstrate its application in comparing genomes and constructing evolutionary trees.
Constructing restriction maps is one of the important steps towards the determination of DNA sequences. Recently, the single-molecule approaches to constructing restriction maps, such as Optical Mapping by D. Schwartz et al., have developed. In practice, with the single-molecule approach like Optical Mapping, the identification of the restriction sites is complicated by several error factors due to resolving power of biological experiments. The ordered restriction map alignment problem is a problem to estimate the actual restriction sites from many imprecise copies of map from single molecule. In this paper, we formulate the problem on the basis of the statistical maximum likelihood estimate, and propose a new efficient local search algorithm for this problem, by applying the Expectation-Maximization (EM) algorithm along with the concept of two-clustering. Our algorithm works well for a lot of sets of simulated data, some of which we believe more difficult than the actual cases.
The selection of a suitable set of primers is very important for polymerase chain reaction (PCR) experiments. Most existing algorithms for primer selection are concerned with producing a primer pair for each DNA sequence. However, when all the DNA sequences of the target objects are already known, like the approximately 6, 000 yeast ORFs, we may want to design a small set of primers to PCR amplify all the targets, which can then be resolved electrophoretically in a series of experiments. This would be quite useful, because decreasing the number of primers greatly reduces the cost of an experiment. This paper extends the problem of primer selection for a single experiment presented in Doi and Imai  to primer selection for multiple PCR experiments, and proposes algorithms for the extended problem. The algorithms design primer sets one at a time. We extend the greedy algorithm for one PCR experiment in  by handling amplified segments in DNA sequences that have been identified by primer pairs already selected and by changing the priorities in the greedy algorithm. This algorithm is applied to real yeast data. The number of primers equaled 85% of the number of identified DNA sequences, which represented more than 90% of all the target DNA sequences. This is 42% the number of primers needed for multiplex PCR. Furthermore, the length of each primer is less than half the length of multiplex PCR primers so the cost of producing the primers is reduced to 20% of the cost in the multiplex PCR case.
We propose a model of the doubling of a bacterial genome followed by gene order rearrangement to explain present-day patterns of duplicated genes. On the hypothesis that inversion (reversal) is the predominant mechanism of rearrangement, we ask how to reconstruct the ancestral genome at the moment of genome duplication. We present a polynomial algorithm for finding such a genome that minimizes (within 2 reversals) the Hannenhalli-Pevzner formula for reversal distance from the modern genome. We illustrate by applying the algorithm to a set of duplicate genes in the Marchantia polymorpha mitochondrial genome.
The massively parallel hybridization technologies by DNA chips and microarrays make it possible to monitor expression patterns of the whole set of genes in a genome under various conditions. The vast amount of data generated by such technologies necessitates the development of a new database management system that integrates expression data with other molecular biology databases and various analysis tools. We report here an extension of our KEGG (Kyoto Encyclopedia of Genes and Genomes) and DBGET/LinkDB systems for analyzing gene expression data in conjunction with pathway information and genomic information. It is now possible to make use of expression data for the reconstruction of pathways from the complete genome sequences.
We are entering a new era of research where the latest scientific discoveries are often first reported online and are readily accessible by scientists worldwide. This rapid electronic dissemination of research breakthroughs has greatly accelerated the current pace in genomics and proteomics research. The race to the discovery of a gene or a drug has now become increasingly dependent on how quickly a scientist can scan through voluminous amount of information available online to construct the relevant picture (such as protein-protein interaction pathways) as it takes shape amongst the rapidly expanding pool of globally accessible biological data (e. g. GENBANK) and scientific literature (e. g. MEDLINE). We describe a prototype system for automatic pathway discovery from on-line text abstracts, combining technologies that (1) retrieve research abstracts from online sources, (2) extract relevant information from the free texts, and (3) present the extracted information graphically and intuitively. Our work demonstrates that this framework allows us to routinely scan online scientific literature for automatic discovery of knowledge, giving modern scientists the necessary competitive edge in managing the information explosion in this electronic age.
A precursor is a compound which is transformed to a class of functional molecules within short steps. It is an important process in the production of natural drugs to decide whether a given compound is a precursor or not. We present two strategies to select precursor compounds in the secondary metabolism of terpenoids: one is to find the packing of basic molecules in the given cyclic structure, and the other is to find the synthetic map of the given set of compounds. Both strategies play important roles in reproducing tracer experiments on a computer.
We have developed the automated processing algorithms for 2-dimensional (2-D) electrophoretograms of genomic DNA based on RLGS (Restriction Landmark Genomic Scanning) method, which scans the restriction enzyme recognition sites as the landmark and maps them onto a 2-D electrophoresis gel. Our powerful processing algorithms realize the automated spot recognition from RLGS electrophoretograms and the automated comparison of a huge number of such images. In the final stage of the automated processing, a master spot pattern, on which all the spots in the RLGS images are mapped at once, can be obtained. The spot pattern variations which seemed to be specific to the pathogenic DNA molecular changes can be easily detected by simply looking over the master spot pattern. When we applied our algorithms to the analysis of 33 RLGS images derived from human colon tissues, we successfully detected several colon tumor specific spot pattern changes.
The way for expressing biological systems is a key element of usability. Expressions used in the biological society and those in the computer science society have their own merits. But they are too different for one society to utilize the expressions of the other society. In this paper, we design the bio-calculus that attempts to bridge this gap. We provide syntax which is similar to conventional expressions in biology and at the same time specifies information needed for simulation analysis. The information and mathematical background of bio-calculus is what is desired for the field of computer science. We show the practicality of bio-calculus by describing and simulating some molecular interactions with bio-calculus.
This study aims at automatic construction of a cell lineage from 4D (multi-focal, time-lapse) images, which are taken using a Nomarski DIC (differential-interference contrast) microscope. A system with such abilities would be a powerful tool for studying embryo genesis and gene function based on mutants, whose cell lineage may differ from that of wild types. We have designed and implemented a system for this purpose, and examined its ability through computational experiments.The procedure of our system consists of two parts:(1) Image processing which detect the positions of the nuclei from each 2D microscope image, and (2) Constructing a hypothetical cell lineage based on the information obtained in (1). We have also developed a tool which allows a human expert to easily filter out erroneous nuclei candidates generated in (1). We present computational results and also discuss other ideas which may improve the performance of our system.
The synergetic effects of multiple marker loci regarding quantitative traits such as blood glucose level have attracted interest. In the OLETF model rat of non-insulin dependent diabetes mellitus (NIDDM), our previous study focusing on the effects of multiple genetic factors has found significant marker combinations with respect to oral glucose tolerance (OGT) at 60 minutes after oral administration. Besides the interaction among markers at a particular time point, their correlated behavior in a time series is another interest. Based on the previous results, in this paper, we report the behavior of markers in a time series by using a series of measurements of OGT.
We have constructed a general framework for integrating application programs with control through a local Web browser. This method is based on a simple inter-process message function from an external process to application programs. Commands to a target program areprepared in a script file, which is parsed by a message dispatcher program. When it is used as ahelper application to a Web browser, these messages will be sent from the browser by clicking a hyper-link in a Web document. Our framework also supports pluggable extension-modules for application programs by means of dynamic linking. A prototype system is implemented on our molecular structure-viewer program, MOSBY. It successfully featured a function to load an extension-module required for the docking study of molecular fragments from a Web page. Our simple framework facilitates the concise configuration of Web softwares without complicated knowledge on network computation and security issues. It is also applicable for a wide range of network computations processing private data using a Web browser.
Complete DNA sequences (genomes) and associated data are being made available worldwide at an astonishing rate. Through computer analysis of such data, molecular biologists hope to gain an overall understanding of the genome, such as by predicting large-scale gene networks. However, this is difficult because diverse genome data are scattered across many highly heterogeneous databases, and because existing database systems lack the facilities to expose and analyze functional relationships among the data. To address these problems, we propose a new type of genome database system. Since a genome can be thought of intuitively as a kind of ‘document’, our system uses a structured document language based on XML to effectively represent genomes and associated data. The information-rich structures of the genome documents help cope with data diversity and heterogeneity. A powerful query language is introduced that exposes important biological relationships among the genome data. We have obtained favorable results from several experiments, demonstrating the usefulness of our method in building a top-down view of genome functionality.
We are developing a system which finds a genetic network from data obtained by multiple gene disruptions and overexpressions. We deal with a genetic network as a weighted graph, where each weight represents the strength of activation from a gene to another gene. In this paper, we explain the overview of our system, and our strategy to visualize the weighted network. Wealso study the computational complexity related to the visualization.