Genome Informatics
Online ISSN : 2185-842X
Print ISSN : 0919-9454
ISSN-L : 0919-9454
Volume 5
Displaying 1-50 of 67 articles from this issue
  • Tatsuya Akutsu, Kentaro Onizuka, Masato Ishikawa
    1994 Volume 5 Pages 1-10
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    This paper describes new methods to evaluate the structural similarity of proteins. In each method, a hash vector is associated with each fixed-length fragment of protein structure, where the following desirable property is theoretically proved: if the root mean square deviation between two fragments is small, then the distance between the hash vectors is small. Using the hash vectors, searching for similar protein structures can be done quickly. The methods were compared with the previous methods using PDB data, and were shown to be much faster.
    Download PDF (855K)
  • Yo Matsuo, Ken Nishikawa
    1994 Volume 5 Pages 11-18
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The 3D-1D compatibility method is a new approach to protein structure prediction. It evaluates the compatibility of a one-dimensional (1D) amino acid sequence with known three-dimensional (3D) structures, and select the most likely structure. We have developed a method, which evaluates the 3D-1D compatibility using the following functions: side-chain packing, solvation, hydrogen-bonding, and local conformation functions. The method has been applied to a large number of sequences in databases. Here, the predictions of the structural similarities between the following pairs are described in detail: spermidine/putrescine-binding protein and maltose-binding protein, shikimate kinase and adenylate kinase, and mannose permease hydrophilic subunit (II AB Man) and galactose/glucose-binding protein. Functional and evolutionary implications of the predictions are discussed. Through these examples of predictions, the present work demonstrates the promise of the 3D-1D method.
    Download PDF (643K)
  • Hiroshi Mamitsuka, Naoki Abe
    1994 Volume 5 Pages 19-28
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We empirically demonstrate the effectiveness of a method of predicting protein sec-ondary structures, β-sheet regions in particular, using a class of stochastic tree grammars as representational language for their amino acid sequence patterns. The family of stochas-tic tree grammars we use, the Stochastic Ranked Node Rewriting Grammars (SRNRG), is one of the rare families of stochastic grammars that are expressive enough to capture the kind of long-distance dependencies exhibited by the sequences of β-sheet regions, and at the same time enjoy relatively efficient processing. We applied our method on real data obtained from the HSSP database and the results obtained are encouraging: Using an SRNRG trained by data of a particular protein, our method was actually able to predict the location and structure of, β-sheet regions in a number of different proteins, whose sequences are less than 25 per cent homologous to the training sequences. The learning algorithm we use is an extension of the ‘Inside-Outside’ algorithm for stochastic context free grammars, but with a number of significant modifications. First, we restricted the grammars used to be members of the ‘linear’ subclass of SRNRG, and devised simpler and faster algorithms for this subclass. Secondly, we reduced the alphabet size (i. e. the number of amino acids) by clustering them using their physico-chemical properties, gradually through the iterations of the learning algorithm. Our experiments indicate that our prediction method not only goes beyond what is possible by alignment alone, but the grammar that was acquired by our learning algorithm captures the type of long distance dependencies that could not be succinctly expressed by an HMM. We also stress that our method can predict the struc-ture as well as the location of β-sheet regions, which was not possible by previous inverse protein folding methods.
    Download PDF (1083K)
  • Satoshi KOBAYASHI, Takashi YOKOMORI
    1994 Volume 5 Pages 29-38
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    This paper proposes a grammatical tool, called tree adjunct grammar with tag for RNA (denoted by TAG2RNA), for representing secondary structures of RNAs, and shows some example TAG2RNA grammars for fairly complicated RNA secondary structures. We then demonstrate the appropriateness of the grammars for modeling RNA secondary structures by discussing its formal language and/or graph theoretic properties, including closure properties of TAG2RNA and graph planarity of secondary structures generated by TAG2RNA, the latter of which would provide a biologically reasonable constraint.
    Download PDF (972K)
  • Hiroki Arimura, Ryoichi Fujino, Takeshi Shinohara, Setsuo Arikawa
    1994 Volume 5 Pages 39-48
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Recently, several attempts have been made at applying machine learning method to protein motif discovery, but most of these methods require negative examples in addition to positive examples. This paper proposes an efficient method for learning protein motif from positive examples. A regular pattern is a string consisting of constant symbols and mutually distinct variables, and represents the set of the constant strings obtained by substituting nonempty constant strings for variables. Regular patterns and their languages are called extended if empty substitutions are allowed. Our learning algorithm, called k-minimal multiple generalization (k-mmg), finds a minimally general collection of at most k regular patterns that explains all the positive examples. We have implemented this algorithm for subclasses for regular patterns and extended regular patterns where the number of variables are bounded by a small constant, and run experiments on protein data taken from GenBank and PIR databases. We incorporate three heuristics into these algorithms for controlling nondeterministic choices. The experiments show that the k-mmg algorithm can very quickly find a hypothesis on the computers in practice, and that the results of our system are comparable with the results of learning method from positive and negative data.
    Download PDF (1076K)
  • H. Ripoche, E. Mephu Nguifo, J. Sallantin
    1994 Volume 5 Pages 49-58
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    This paper concerns the use of an object-oriented database for the analysis of protein sequences. We describe proteins either by bibliographic information or by prediction function such as Prosite patterns [2, 5]. We propose to use concept lattices-a tool used in information retrieval to build thesauruses-to classify protein sequences. This classification of proteins may help finding sequence alignments, or discussing about them. Conversely, sequence alignments can be used to criticize the structuration of sequences.
    Download PDF (888K)
  • S. Tsumoto, H. Tanaka, K. Tsumoto, I. Kumagai
    1994 Volume 5 Pages 59-69
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Protein structure analysis from DNA sequences is an important and fast growing area in both computer science and biochemistry. Although interesting approaches have been studied, it is very difficult to capture the characteristics of protein, since even a simple protein have a complex combinatorial structure, which makes biochemical experiments very difficult to detect functional components. For this reason, almost all the problems about this field are left unsolved and it is very important to develop a system which assists researchers on molecular biology to remove the difficulties by a combinatorial explosion. In this paper, we propose a system based on combination of a probabilistic rule induction method with domain knowledge, which we call MOL A-MOL A (Molecular biological data-analyzer and Molecular biological knowledge acquisition tool) in order to retrieve the hassles from the experimental environments of molecular biologists. We apply this method to comparative analysis of lysozyme and a-lactalbumin, and the results show that we get some interesting results from amino-acid sequences, which has not been reported before.
    Download PDF (1208K)
  • Makoto Hirosawa, Reiko Tanaka, Hidetoshi Tanaka, Masayuki Akahoshi, Ma ...
    1994 Volume 5 Pages 70-79
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Major advances in bio-technology enable us to describe various phenomena occuring in the body using the language of genes and proteins. It is important to represent these phenomena in knowledge base. and to visualize them properly. The visualization of the phenomena with reference to related databases facilitates research on genes.
    As the first step in realizing a database like the one stated above, we have studied the representation of biological knowledge needed to describe biological phenomena and have developed a prototype knowledge base. The knowledge base is described in micro-Quixote, an object-oriented database language executable on Unix. The knowledge base can cover the knowledge related to signal transduction within a cell and that related to transcription of genes.
    In our prototype system, a sort of simulation can be done. With the arrival of a signaling ligand at the surface of a cell, proteins along suitable pathways are activated in our simulated cell. As a consequence of series of activations (a chain of inferences), some biological responses are deduced and shown to users.
    Download PDF (1007K)
  • An Environment for Simulating Protein Interaction
    Masanori Arita, Masami Hagiya, Tomoki Shiratori
    1994 Volume 5 Pages 80-89
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Biological analysis of Drosophila embryogenesis has provided a model of protein interaction in segment formation. In this paper we introduce GEISHA system, which verifies and revises the rules of pattern formation in embryogenesis. The system consists of three parts: rule-based simulator, evaluator, and user interface. The simulator tests all the possible rule patterns, and the evaluator qualitatively evaluates results of the simulator; it searches for the desired pattern of protein expression. The user interface enables us to input or save data using GUI.
    Download PDF (819K)
  • Takahiro Ikeda, Hiroshi Imai
    1994 Volume 5 Pages 90-99
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The multiple alignment of the sequences of DNA and proteins is applicable to various important fields in molecular biology. Although the approach based on Dynamic Pro-gramming is well-known for this problem, it requires enormous time and space to obtain the optimal alignment. On the other hand, this problem correspondsto the shortest path problem and the A algorithm, which can efficiently find theshortest path with an estimator, is usable.
    This paper directly applies the Aalgorithm to multiple sequence alignment problem with more powerful estimator inmore than two dimensional case and discusses the im-provement of this approach utilizing an upper bound of the shortest path length. The algorithm to provide the upper bound is also proposed in this paper.
    Download PDF (1112K)
  • J. Gracy, J. Sallantin
    1994 Volume 5 Pages 100-109
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    A generalization of the dynamic programming algorithm applied to the multiple align-ment of protein sequences is proposed. The algorithm has two main procedures:(i) local correspondences between sequences-hereafter called anchor points-are selected accord-ing to a criterion that combines local and global simlilarity values, (ii) the alignment is constructed recursively by choosing and linking together the optimal anchor points. This multiple sequence alignment algorithm achieves a good compromise between the O (LN) complexity of the exhaustive dynamic programming approach applied to N sequences of length L and the poor quality of the alignments obtained with methods based on a hierar-chical clustering of the sequences.
    Download PDF (968K)
  • M. Ishikawa, T. Toya, Y. Totoki, R. Tanaka
    1994 Volume 5 Pages 110-119
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have developed a multiple sequence alignment system which aligns RNA sequences while estimating their stem regions. The system consists of two parts: initialand stem aligners. The initial aligner roughly aligns given RNA sequences using a parallel iterative algorithm based on dynamic programming. The stem aligner refines the rough alignment using a parallel simulated annealing algorithm taking into account connected base pairs in stem regions. In testing with t RNA sequences, the system could generate alignments which identified well-known stem sets of clover shape. We have also developed a stem specifier which monitors such stemregions using a circular representation.
    Download PDF (950K)
  • Hideo Matsuda, Hiroshi Yamashita, Yukio Kaneda
    1994 Volume 5 Pages 120-129
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Phylogenetc analysis of DNA sequences has played an important role in the study on evolution of life. However recent researches suggest that in some cases phylogenetic analysis of protein sequences is more important than that of DNA sequences. Thus we developed a system for phylogenetic analysis of protein sequence data. Since this system is based on our previously developed system for the analysis of DNA sequence data, one can obtain both protein-based and DNA-based trees and compare them. In the two systems, we took the same tree-construction algorithm (so called, a maximum likelihood method). Although this method has concrete models of the evolutionary process, it requires a huge amount of computational costsespecially in the analysis of protein sequence data. Therefore we parallelized tree-construction steps in our method on a massively parallel machine.
    Download PDF (988K)
  • Masahiko Mizuno, Minoru Kanehisa
    1994 Volume 5 Pages 130-137
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have analyzed the distribution of base composition around the 5' and 3' splice sites in genomic DNA sequences of different species. A set of sequences belonging to one species is aligned at the 5' and 3' splice sites, respectively, and the average of base composition is calculated for 10 base windows over the range of 100 bases each for upstream and downstream regions. In consistent with the previous observations that coding regions are more guanine-cytosine (GC) rich thannoncoding regions, we observe a jump in the GC content at the splice sites, except for vertebrate sequences. In addition, introns are Uracil (U) rich rather than Adenine-Uracil (AU) rich, especially in plants and invertebrates. It is also found that the pyrimidine rich regions preceding the 3' splice site in mammals extend upstream over the consensus sequences, while the polypyrimidine tracts in plants and invertebrates are much shorter than in mammals. Furthermore, the size of increase in pyrimidine content is more striking at the 3' splice site in mammalian, but is smaller in plants and invertebrates. Thus, we consider that the broad and intensive polypyrimidine tract is required for the recognition of the 3'splice site in the higher eucaryotes, where introns are GC rich, and that more AU rich intron is important in the lower eucaryotes.
    Download PDF (767K)
  • Takeshi Itoh, Minoru Yano, Keiko Takemoto, Yutaka Akiyama, Hirotada Mo ...
    1994 Volume 5 Pages 138-139
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    It is possible to elucidate whole genome structure by current technique. The genome projects of some species, C. elegans, Yeast, Escherichia coli, Bacillus subtilis, Arabidopsis, rice and human are now running. In Escherichia coli, two lines of large scale sequencing have emerged. One by the Wisconsin group in U. S. A. and the another by the collaborative research group in Japan. To make a non redundant sequence database is essential not only for effective promotion of sequencing project but for whole genome analysis and reference by biologists. We determine the sequences as one of the research group in Japan and make a non redundant DNA sequence database for effective promotion of genome project and analysis of genome structure. In Genome Workshop meeting 1993, we reported the construction of Escherichia coli genome database on Genomatica system. We update our E.coli genome database by incorporating of E.coli new entries of GenBank and from genome project research groups. The contiguous sequence data were then used to predict possible open reading frames. The translated amino acid sequences from these ORFs were subjected to homology analysis against the PIR and the SWISSPROT protein database. The whole sets of plausible ORF's were further classified by similarities between ORF's and those of gene organizations. It may be possible to detect rearrangements of chromosome through its own evolution by that analyses.
    Download PDF (235K)
  • Hajime Kitakami, Yoshio Tateno, Takashi Gojobori
    1994 Volume 5 Pages 140-141
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We newly developed a repair system which is needed to effectively remove inconsistencies in each data bank and mismatches among data banks over international computer networks. This paper describesearch functions to be useful for effectively removing both inconsistencies and mismatches from the databases. These functions are implemented in a relational database management system, SYBASE.
    Download PDF (227K)
  • H. Mizushima, K. Hayashi
    1994 Volume 5 Pages 142-143
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We started two databases ‘TFDB’ and ‘HMDB’. Transcription Factor DataBase (TFD) was originally maintained by D. Gohsh at National Center for Biotechnology Inforrnaion (NCBI), National Library of Medicine, National Institutes of Health. As NCBI stopped its maintenance since last year, we started a new database, TFDB, to maintain some parts of the database mainly focusing to the DNA binding sequence data. HMDB (Human Mutation DataBase) is a new database collecting Information about mutation in the human genome. As both databases are started very recently, they are still at preliminary stages. We will continue to put more informaions in the future.
    Download PDF (169K)
  • K. Suzuki, S. Goto, Y. Akiyama, M. Kanehisa
    1994 Volume 5 Pages 144-145
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We are developing a signal transduction database which represents molecular interactions involved in the signaling pathways in a cell from the activation of cell surface receptors by external signals to the activation of transcription factors in the nucleus. The database is linked to the Medline literature, the SWISS-PROT and PIR protein sequence database, the PDB protein 3-D structural database, the LIGAND chemical database for enzyme reactions, and the OMIM database on genetic diseases. We provide a graphical user interface of the World Wide Web (WWW) to access this database.
    Download PDF (179K)
  • Nobuyuki Miyajima, Shinobu Nakayama, Mitsuyo Kohara, Satoko Hayashi
    1994 Volume 5 Pages 146-147
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have developed a sophisticated method called “The Gene Network” for elucidating the relationships existing among all genes. This was accomplished by employing 24222 gene symbols contained in the MEDLINE database of Entrez rel. 12.0, then determining their inter-relationships by examining the frequency of appearance among one another. This new method enables construction of gene maps which graphically show their relationships, there by enhancing the understanding of them. We expect The Gene Network will have the future capability to allow navigation through the “world of genes.”
    Download PDF (156K)
  • Toshio Shimizu, Kenta Nakai
    1994 Volume 5 Pages 148-149
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    How reliable and useful are predictions of transmembrane segments (TMSs) of membrane proteins from the amino acid sequences? It remains still under debate. Kyte and Doolittle proposed a simple scheme for the prediction of TMSs [1]. It is based on the hydropathy plot and is widely accepted as a basic and standard method. Since then, a large number of more sophisticated predictive algorithms have been proposed, which are improved varieties of the Kyte-Doolittle's approach. Although these methods have been considered to give rather good results, their abilities are still not enough to predict the number and positions of TMSs precisely; they often give totally different predictive results with proteins having many TMSs, in particular [2, 3]. One reason for this situation can be attributable to the low quality of the information on TMSs described in general amino acid sequence databases. The information included within the SWISS-PROT database, for example, is mostly not based on any experimental evidence but on predicted models; there is often no explicit description about whether the data comes from experiments or calculations in databases. Higher quality of information on TMSs from experimental evidence only is essential to evaluate existing prediction methods more precisely and to develop an algorithm overcoming their problems.
    We have collected 128 references reporting the membrane topology of proteins, and are continuing our efforts to triple this number. From them, we selected 54 topology models based on experimental evidence, at least partially. Combining these data with the sequence information from the SWISS-PROT database, we have constructed a membrane protein database in the form of relational database. Current version includes 54 proteins which are classified into 3 groups (eukaryotic proteins, prokaryotic proteins, and the proteins with non-helical segments) as shown in Figure 1. Using this database we evaluated the predictability of the algorithms of following authors: Eisenberg [4]; Klein, Kanehisa and DeLisi (KKD method)[5]; von Heijne (TopPred method)[6]; and Persson and Argos [7]. The KKD method and the TopPred method predicted the exact number of TMSs for 59% and 67% of proteins in our database, respectively. These values could be increased to 63% and 74% by optimizing respective parameter values. The KKD method tends to predict fewer number of TMSs than the correct number, while the TopPred method shows the opposite tendency. We are now testing our previous idea to use different cut-off parameters for one TMS proteins and multiple TMS proteins in the KKD method and are also trying to develop a new predictive algorithm, by taking more precise position-dependent information on TMS into account.
    Download PDF (205K)
  • Makiko Suwa, Takatsugu Hirokawa, Shigeki Mitaku
    1994 Volume 5 Pages 150-151
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (198K)
  • Akira Shimada, Hideki Takehara, Kazunori Toma
    1994 Volume 5 Pages 152-153
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We defined the protein inverse folding problem with an object oriented database and an empirical hydrophobic penalty function, which was derived from the number of residues around each residue in a protein three dimensional structure. Under the database management system, we compiled the known structures of proteins and the evaluation function into one functional database. In order to compare our approach with the methods proposed by other groups, the functional database was applied to the problem of globin family recognition. Although the penalty function itself is simple and non-optimized, it gave considerably good results.
    Download PDF (189K)
  • Kenji Satou, Emiko Furuichi, Shin'ichi Hashimoto, Yukiko Tsukamoto, Sa ...
    1994 Volume 5 Pages 154-155
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have developed a deductive database system PACADE for analyzing three dimensional and secondary structures of protein. A function newly introduced to PACADE is described here. It enables to compute a closure of indirect similarity relationships among structure of proteins.
    Download PDF (190K)
  • Kiyoshi Asai, Katunobu Itou, Kentaro Onizuka, Masayuki Akahoshi, Hidet ...
    1994 Volume 5 Pages 156-157
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (192K)
  • Hiroshi Mamitsuka
    1994 Volume 5 Pages 158-159
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We propose a new method for predicting long-range interactions between amino acid residues based on what we term ‘homological correlations.’ Here two amino acid residues in a given sequence are said to be homologically correlated, if the substitution patterns of those positions in sequences homologous to the given sequence are correlated. Our method picks out a pair of amino acids of interest at a time, in general in distant positions, and predicts if they are homologically correlated.
    An important characteristic of our method is that we enhance the input sequence (s) by obtaining sequences homologous to it, not only in training but also in testing. In particular, our method constructs a stochastic rule which takes as input the changes in the pair of positions of interest, and predicts whether or not there exists a long-range interaction between those positions, or more precisely it gives the likelihood for the pair to comprise a long-range interaction. On the basis of the likelihood calculated for each pair, our method finally predicts the pairs of positions comprising a strong long-range interaction, using a.two-stage prediction method which consists of a type of heuristic-search algorithm and the Boltzman annealing technique [1]
    In this paper, as a preliminary experiment for demonstrating effectiveness of our method, we focus on the problem of predicting the locations of disulfide bonds, which are a good example of long-range interactions. Disulfide bonds are covalent bonds which form between the side chains of two cysteine residues, adjacent in the three-dimensional structure, but located in distant positions in the primary sequence (e.g.[2]). Thus the problem of predicting the locations of disulfide bonds here is to determine the pairs of cysteines in a given sequence with unknown disulfide bonds, each of which forms a disulfide bond.
    In our experiments, we extracted four proteins from the PDB_LIST 35% LIST [3] to meet a condition that at least 50 additional sequences which are homologous to each of four proteins are available from the HSSP (Homology derived secondary structure of proteins) database [4] Ver 1.0. The four proteins, each of which has less than 35% homology to the other three, are shown in Table 1. Our experimental result shows that, even when only one of the four proteins is used as training data, our method was able to predict all of the locations of disulfide bonds in all four proteins.
    This result indicates that there exists a clear correlation between the substitutions of amino acids at any two positions which comprise a long-range interaction such as disulfide bonds. Also, this result suggests that our homological correlation based method is potentially useful in identifying various types of long-range interactions, such as helix-helix or helix-sheet contacts, any of which are thought to be crucial keys to predicting protein three-dimensional structures (e.g.[2]). At present, one biggest disadvantage of our homological correlation based method consists in use of a number of sequences for a given input in both learning and prediction, but such difficulty will be overcome in the future by immense increase of determined sequences with development of various kinds of genome sequencing projects, and then homological correlation will be greatly useful in predicting various long-range interactions described above.
    Download PDF (191K)
  • Haretsugu HISHIGAKI, Tamio YASUKAWA
    1994 Volume 5 Pages 160-161
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Molecular evolution processes of cysteine and serine proteases were followed in a reversed way by the combined use of the inverted Dayhoff matrix and the 3D-1D method.
    Download PDF (201K)
  • Hiroaki KATO, Yoshimasa TAKAHASHI
    1994 Volume 5 Pages 162-163
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    This paper describes an approach to three-dimensional (3-D) substructure search using graph-theoretical algorithms, and it's application to the analysis of 3-D structural features of proteins. An abstract representation of protein 3-D structures also devised from this analysis. The details of the approach will be discussed with a couple of illustrative examples that involve the motif search using Protein Data Bank (PDB) files.
    Download PDF (191K)
  • Motokazu KAMIMURA, Yoshimasa TAKAHASHI
    1994 Volume 5 Pages 164-165
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    This paper describes the discrimination of φ-ψ conformational pattern classes for protein amino acid residues which were defined in our previous works. Statistical discriminant analysis technique has been employed for the present analysis. Each residue was characterized by its peripheral physicochemical environment. The environment was described in a vector representation of which components involve Van der Waals volume, hydrophobic parameter π, and partial charges of a carbon atom, hydrogen atom of NH and oxygen atom of C'O of ten neighbor residues (five neighbors in each terminal side of the target residue). The discriminant functions obtained with 67 proteins taken from the PDB file correctly discriminated 58.3% of the residues for their conformational pattern classes.
    Download PDF (201K)
  • K. Nakata
    1994 Volume 5 Pages 166-167
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (111K)
  • Takashi Ishikawa, Shigeki Mitaku, Takao Terano, Takatsugu Hirokawa, Ma ...
    1994 Volume 5 Pages 168-169
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Protein function prediction from amino-acid sequences is one of the major tasks in genome informatics.To predict protein functions of a given amino-acid sequence, we can use similarities amongfunctions and structural features of amino-acid sequences, i.e., motif and homology. Difficulties of theprevious function prediction methods are caused by the facts that few already known motif have beenfound and that proteins of similar sequence may not have similar functions. A main objective of ourresearch is to facilitate to find functional features of proteins using machine learning techniques.
    Our hypothesis for the protein function prediction is that a protein function arises from physicalstructures of the protein. Since the structures of proteins are built with physico-chemical interactionsamong amino-acids, there might exist some features of amino-acid sequences according to the physicochemicalinteractions. We call these features ‘functional features’. We know that there exists electricinteractions among alpha-helices of bacteriorhodopsin from its tertiary structure of the protein andlocalization of polar amino-acids in the structure. If the amino-acids localization of bacteriorhodopsinis closely related to the function of the protein, we can use this functional feature to predict proteinfunction.
    To create rules to predict protein functions, we use the three machine learning techniques (Fig.1). The first technique is analogical reasoning to make a assumptions about functional features. Forexample, if there exists localization of polar amino-acids in some proteins, then the localization mightimply relation between the functional features and functions of the protein, using analogical reasoningfrom the fact about bacteriorhodopsin. The second technique is inductive reasoning to generalize thehypothesis made by analogical reasoning. The goal of the inductive reasoning for protein functionprediction is to decide which localization pattern is most useful to classify protein functions. Thethird technique is deductive reasoning to refine the localization pattern into classification rules. Inthe deductive reasoning, knowledge about protein functions and structures are used to make logicaldescription of classification rules.
    We have carried out some experiments to implement our idea to find functional features of proteinsusing machine learning techniques. First we have simulated analogical reasoning process tocreate a hypothesis about functional features of bacteriorhodopsin using ABA framework proposedby authors [1]. In the current stage of our research, this analogical reasoning process is executed byhand simulation, but it will be executed on a computer in the next stage. Next we have analyzedthe relation between the functional features and protein protein functions of seven-helices membraneproteins using a cluster analysis method. From this analysis, we have found that amino-acid intervalfrequencies for polar amino-acids is closely related to some function classes of the classified proteins.The feature of the amino-acid interval frequencies is thought to be a representation of the abstractfunctional feature: ‘localization of amino-acids’. From the result of this cluster analysis, we can usethe functional features for the inductive reasoning in the next step.
    In the preliminary experiments described above, we have found new functional features to classifyprotein functions from amino-acid sequences. Specifically, these features can discriminate differentfunctions of proteins that have similar amino-acid sequences in homology analysis. Furthermore, thefeatures can recognize same function proteins that have not similar sequences. From these results westate that our idea is useful to predict protein functions. In the next stage of the research, we have aplan to refine classification rules and to integrate three machine learning techniques.
    Download PDF (197K)
  • Kenta Nakai, Ayumi Shinohara, Satoru Miyano
    1994 Volume 5 Pages 170-171
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    In this age of large-scale sequencing, we have many “potentially expressed” amino acidsequences of unknown function. Characterization of such sequences by computers is undoubtedlyuseful for further experimental analyses. We have developed a knowledge-based system PSORT for characterizing various sorting signals potentially coded in amino acid sequences andfor predicting their final localization sites in cells [1, 2]. The system calculates the probability (certainty factor) of an input protein to be localized at each candidate site. One of the difficultiesof our system is that, since it has many adjustable parameters, optimization of them toa given training data is difficult. Therefore, incorporation of recent knowledge into the systemhas not been easy. We present here a simple scheme for assigning certainty-factor parameters with a given reasoning tree.
    Since the size of training data, i. e., sequences of known localization sites, is not large inmost cases, we must suppress the number of parameters as possible. In this case, use of ourknowledge on the reasoning flow is favorable. Such a flow can be organized into a reasoningtree, in which an input flux is divided into thinner flows on a step-by-step basis according tosome characteristic values calculated from the input sequence (Fig. 1). Its final outputs areflows corresponding to candidate localization sites. In this stage, the amount of each flow can beinterpreted as the corresponding certainty factor. Thus, the problem is how to find appropriate functions that transform a characteristic value at each step in an optimized performance forthe classification of training data. We used the following formula for that function:
    Fp (xp (i)) =1/1+exp (-10×(xp (i)-bp)) where xp (i) represents a characteristic value of a sequence i at the step p, e.g., propensitythat the input sequence i encodes a membrane protein, and bp is a threshold value which isobtained by the criterion that can classify the training data at step p with least mistakes.The certainty factor for localizing a candidate site is thus calculated as a probability to choosethe corresponding path, e.g., the certainty factor for a protein i to localize at the site #3 isF1 (i)×F2 (i)×(1-F4 (i)) in Fig. 1.
    To test the validity of our model, we prepared 156 sequences of Bacillus subtilis whoselocalization sites are the prediction results of PSORT. The cross-validation test showed rathergood result. Thus, although there is no theoretical proof that our model always gives goodresults, it will be hopefully used for future improvement of PSORT. Moreover, because of itssimplicity, this method may be generally used to interpret unknown sequence data with the latest knowledge of molecular cell biology.
    Download PDF (174K)
  • Hiroshi Nakashima, Ken Nishikawa
    1994 Volume 5 Pages 172-173
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (184K)
  • Yukihiro Eguchi, Yuzo Ueda
    1994 Volume 5 Pages 174-175
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (175K)
  • A Method for Knowledge Acquisition from Amino Acid Sequences
    Hideaki Nakakuni, Takeo Okazaki, Satoru Miyano
    1994 Volume 5 Pages 176-177
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (210K)
  • Tetsushi Yada, Masato Ishikawa, Hidetoshi Tanaka, Kiyoshi Asai
    1994 Volume 5 Pages 178-179
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (242K)
  • Mikita Suyama, Takaaki Nishioka, Jun'ichi Oda
    1994 Volume 5 Pages 180-181
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have developed a program GAPE (Gap Allowing Pattern Explorer) to extract amino acid sequence motifs conserved among distantly related proteins. The GAPE program is designed to allow a gap in the sequences. When the program is applied to some ligand-related consensus sequences, motifs extracted with low expectation of occurrence contain some of the amino acid residues chemically proved to be involved in the ligand recognition.
    Download PDF (201K)
  • O. Gotoh
    1994 Volume 5 Pages 182-183
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (202K)
  • Hiroshi Tanaka, Fengrong Ren, Norio Fukuda
    1994 Volume 5 Pages 184-185
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The maximum likelihood method and minimum evolution method are mainly used to reconstruct the molecular evolutionary phylogenetic trees. But, from statistical viewpoints, each of them has its own problem. To improve this, we combine the maximum likelihood method with minimum evolution method, on the framework of Bayesian maximum a posterior probability (MAP) estimation.
    Download PDF (202K)
  • Shigehiko Kanaya, Toshimichi Ikemura, Yoshihiro Kudo
    1994 Volume 5 Pages 186-187
    Published: 1994
    Released on J-STAGE: November 16, 2011
    JOURNAL FREE ACCESS
    Rapid advance in experimental and theoretical techniques in genetics has afforded both abundant and useful information, and also difficulty in data processing and interpretation. Indeed, by increase of nucleotide sequence data accumulated by pioneers in the field, some relations between gene function and codon usage have been clarified. As the data increase, it becomes more difficult to characterized these relation systematically, because we must recognize the 64 kinds of codon frequencies on a great number of genes. It is expected that multivariate analysis make it possible to overcome this difficulty and to characterize genes in terms of codon usage.
    In order to investigate the factors involved in the diversity of Escherichia coli genes in terms of codon usage, and clarify some relations between codon usage and gene function, we have constructed a data set consisting of about two thousand genes with the following information: gene name, gene function and codon usage. In the present paper, we recognize some relations between codon usage and gene function by means of a principal component analysis of this data set.
    To exclude the effect of amino acid compositions on codon usage, firstly, frequencies of codons in each synonymous group were normalized to unity, and all of data were represented in form of a matrix, Xij, where i=1, 2, ..., N and j=1, 2, ..., M (N and M denote the number of genes and codons used, respectively). For an ith gene (i=1, 2, ..., N), the vector consisting of the normalized codon frequencies, (xi1, xij, xiM), is transformed to the vector consisting of principal components, (zi1, ..., zij, ..., ziM), according to the following conditions.(1) A correlation of principal components between Zk and Zk', is zero, and (2) the first principal component, Z1, is the linear combination of the variables, Xj, with the largest variance, and the second principal component, Z2, is the linear combination with the second largest variance, and so on. Zk=bk1X1+......+bkmXM (k=1, 2, ..., M) where_??_
    By scattering genes on a map consisting of the first k principal components, we can comprehend proximities among them from a viewpoint of the structure of codon usage.
    We have assembled a data set with the following information: gene name, category name [M. Riley, Microbiol Rev., 54, 862-952, 1992] of cellular function of the gene product, genomic map position, and complete coding sequence extracted from DDBJ (Release 18, 1994). The data set consists of 1528 genomic coding genes, 26 transposon-related genes, 106 plasmid genes, and 574 function-unknown genomic open reading frames (simply called ORFs). The first two components (PC1 and PC2) account for more than 10% of the original variance (24.8% and 14.5%, respectively). It is observed that the loadings of the PC1 is negatively correlated to the preference codons reported [T. Ikemura, In Hatfield, D. L., Lee, B. J. and Pirtle, R. M.(Ed.), Transfer RNA in protein synthesis., pp.87-111, CRC Press, London.]. This suggests that the largest diversity of genomic genes is mainly explained in terms of the preference codon usage.
    Download PDF (222K)
  • Y. Nakamura, T. Fukagawa, K. Sugaya, T. Ikemura
    1994 Volume 5 Pages 188-189
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (223K)
  • Motoe Sasanuma, Zhong-qing Wang, Kazuhiro Shibata, Syun-ichi Sasanuma, ...
    1994 Volume 5 Pages 190-191
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (169K)
  • Zhong-qing Wang, Yasufumi Murakami, Toshihiko Eki, Akira Oyama, Yukihi ...
    1994 Volume 5 Pages 192-193
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (276K)
  • T. Nishikawa, S. Hiraoka, N. Kasahara, K. Nagai
    1994 Volume 5 Pages 194-195
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We developed a program that determines whether or not a query sequence is included in a database within a permitted matching error rate. It consists of two steps: bit-table filtration and dynamic programming matching. The bit table filtration quickly excludes many sequences that have no relation to the query sequence and identifies the sequences without missing that match the query sequence within the given error rate. The application of this program to large-scale human cDNA grouping showed that it took only one tenth the time required by FASTA for grouping all human cDNA.
    Download PDF (204K)
  • Computer softwares for entry and analsis of the human genome data
    Shinsei Minoshima, Nobuyoshi Shimizu
    1994 Volume 5 Pages 196-197
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (172K)
  • Software Tools for Genome Mapping and Sequencing
    Akira Suyama, Akira Ohyama, Masami Hagiya, Yoshiaki Furuhata, Takashi ...
    1994 Volume 5 Pages 198-199
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (207K)
  • T. Niiyama, A. Takeuchi, K. Kotani, I. Uchiyama, A. Ogiwara, K. Nakai
    1994 Volume 5 Pages 200-201
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (169K)
  • Yutaka Akiyama, Takahiro Yakoh, Hirotada Mori, Naotake Ogasawara
    1994 Volume 5 Pages 202-203
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have been developing a new server-client version of “Genomatica”, an integrated data management and analysis tool for supporting genome sequencing projects. Now the client browser can work even with no local data file, retrieving the latest genome information maintained at the Genomatica server sites. The server-client communication is based on HTTP (HyperText Transfer Protocol).
    The previous version of Genomatica system [1] was designed for the use by expert information managers of a genome sequencing project. The old system always required a large amount of disk space on local machine for storing several files of whole genome data.
    New server-client version is upper-compatible to the previous system and it allows general users to access genome information files maintained at remote server sites. Location of each local or remote file can be customized at the system configuration menu using URL (Unified Resource Locator) representation.
    In order to improve response time, users may keep local copy of some basic data files, downloading from the Genomatica servers. Also users can overlay their private genome sequences and/or comments onto the public genome information supplied from the server.
    The system is running on Unix workstations with X11-Motif window environment. The client program and several E.coli and B.subtilis data files are available on the GenomeNet ftp server (ftp.genome.ad.jp).
    Download PDF (173K)
  • Susumu Goto, Satoru Kuhara, Minoru Kanehisa, Toshihisa Takagi
    1994 Volume 5 Pages 204-205
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (186K)
  • Sequence Motif Analysis and Retrieval Tool
    A. Ogiwara, T. Takagi, I. Uchiyama, M. Kanehisa
    1994 Volume 5 Pages 206-207
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We present further improvements on the computer system SMART (sequence motif analysis and retrieval tool) that assists biological interpretation of sequence data by searching sequence motifs in query sequence and annotating functional features associated with the motifs found. The new version of the system fully utilizes the network communication based on a client-server model so that users run only the client program on their workstations without any database resources locally.In the previous version, SMART could treat either PROSITE or MotifDic as a motif dictionary, but in the new release, a new motif dictionary characterizing structural groups derived from PDB is also available. SMART runs on Sun workstations using the XView graphical user interface.
    Download PDF (287K)
  • Kagehiko Kitano, Atsushi Ogiwara, Toshihisa Takagi
    1994 Volume 5 Pages 208-209
    Published: 1994
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    This paper presents an overview of Gidre: Genome Integrated Database Retrieval Environment. Gidre provides biological researchers with facilities to access information of interest. Gidre allows users to refer easily to various genome databases and to execute many useful genome applications with a pointing device by its graphical user interfaces. And we adopted a ‘client/server’ mechanism as Gidre's model. With its flexible structure we can expand easily Gidre's functions and components. Now it operates on Sun workstations.
    Download PDF (375K)
feedback
Top