-
Tatsuya Akutsu
1993Volume 4 Pages
1-9
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
In this paper, we consider the pattern matching problems for three dimensional protein structures. Especially, we consider the problems of substructure search and common substructure search. First, we show that the common substructure search problem amongst multiple protein structures is very difficult from a theoretical viewpoint of computational complexity. Next, we present two practical algorithms. One is named a
least-squares hashing method and the other is named
a dynamic matching method. In the least-squares hashing method, the hashing technique, which is well-known in computer science, is combined with a least-squares fitting technique. In the dynamic matching method, the dynamic programming technique, which is widely used for pattern matching of DNA and amino acid sequences, is combined with a least-squares fitting technique. These two methods have been applied to PDB (Protein Data Bank) data and shown to be effective.
View full abstract
-
MAKOTO HIROSAWA, REIKO TANAKA, MASATO ISHIKAWA
1993Volume 4 Pages
10-16
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
The representation of biological concepts in a knowldge base are important to a machine or a non-specialist of biology to understand and analyze genetic information. In our previous study, we studied the representation of biological knowledge and the representation of biological knowledge related to motif of protein with the goal of discovering new motifs.
In this paper, firstly, the requirements for the representation of biological knowledge are listed. Then, solutions to these requirements are stated. Finally, representation of bioloigal knowledge on motif in the Deductive Object-Oriented Language,
QUIXOTΣ, is shown. The knowledge base includes Prosite, a representative motif database, as the basis of the knowledge base.
View full abstract
-
Hideo Matsuda
1993Volume 4 Pages
17-24
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
This paper proposes a prototype system for querying genomic database based on data-parallel logic programming. The efficient access to genomic database is crucial, given enormous increase in sequence data. By using a logic programming language, the system allows a user to perform adaptable data retrieval to integrated data objects in a single declarative framework. In addition by utilizing data-parallel processing, it provides efficient access in a large amount of genomic data on distributed computing environment. We present its design principle and discuss the implementation of the database system.
View full abstract
-
Kenji Satou, Emiko Furuichi, Satoru Kuhara, Toshihisa Takagi
1993Volume 4 Pages
25-35
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We developed a deductive database system PACADE for analyzing three dimensional and secondary structures of protein. PACADE is equipped with a function to search for similar structures in proteins. Unlike other approaches based on calculation of the inter-atomic root mean square distance, this function is based on logic programming and source level rule rewriting techniques.
We describe herein the result of searches for topologically similar structures and three dimensionally similar ones. A user of PACADE can select these two levels of similarities by adding/deleting prefixes.
View full abstract
-
Applications to Modeling RNA
Yasubumi Sakakibara, Michael Brown, Rebecca C. Underwood, Saira I. Mia ...
1993Volume 4 Pages
36-45
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of homologous RNA sequences. These models capture the common primary and secondary structure of the sequences with a context-free grammar, much like those used to define the syntax of programming languages. SCFGs generalize the hidden Markov models used in related work on protein and DNA sequences. The novel aspect of this work is that the SCFGs developed here are learned automatically from initially unaligned and unfolded training sequences. To do this, a new generalization of the forward-backward algorithm, commonly used to train hidden Markov models, is introduced. This algorithm is based on tree grammars, and is more efficient than the inside-outside algorithm, which was previously proposed to train SCFGs. This method is tested on the family of transfer RNA (tRNA) sequences. The results show that the model is able to reliably discriminate tRNA sequences from other RNA sequences of similar length, that it can reliably determine the secondary structure of new tRNA sequences, and that it can produce accurate multiple alignments of large collections of tRNA sequences. The model is also extended to handle introns present in tRNA genes.
View full abstract
-
Hiroshi Mamitsuka
1993Volume 4 Pages
46-55
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We propose a new method for representing a local region of a protein sequence as a proba-bilistic network. The method produces, from a large number of examples of a local region, a network which describes dependency relationships that exist among amino acid residues in the region. The network is constructed using the greedy-search algorithm based on the minimum description length (MDL) principle. In our experiments, we construct two probabilistic networks of two α-helix regions in globin family protein. Experimental results show that our method provides a visual aid to understanding inter-residue dependencies of those regions with probabilistic networks, and the networks capture several important features which are peculiar to those regions.
View full abstract
-
Yukiko Fujiwara, Akihiko Konagaya
1993Volume 4 Pages
56-64
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
In this paper, we study the application of HMM to the problem of representing protein sequences by a stochastic motif. A stochastic (protein) motif represents the portions of protein sequences that have a certain function or structure, where conditional probabilities are used to deal with the stochastic nature of the motif. We proposed the
iterative duplication method for HMM network learning. HMMs are much more expressive than symbolic patterns and are better suited to represent the variety of protein sequences. As an experiment, we constructed HMMs for leucine zipper motif using 112 protein sequences as a training set, and obtained an accuracy of 79.3 percent in the prediction of protein sequences, compared for an accuracy 14.8 percent when using a symbolic representation. Our approach can be used also for the validation of protein databases; the automatically constructed HMM has indicated that one protein sequence annotated as “leucine-zipper like sequence” in the database is quite different from other leucine-zipper sequences in terms of likelihood.
View full abstract
-
Masami Hagiya
1993Volume 4 Pages
65-73
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
The problem of constructing contigs by the STS strategy is a simple combinatorial problem if the given hit information is correct and complete. However, hit information is often incorrect or incomplete due to failure or inability of experiments. Moreover, in addition to hit information, various sources of information are also available, such as known landmarks, other clone libraries, etc. In order to cope with incompleteness, incorrectness and additional information, we developed a deductive method for constructing contigs. Contigs are constructed by deducing an equivalence relation of clone directions and a partial order among STS markers on each equivalence class of directions. In the paper, a practical algorithm based on the method is presented and its completeness is proved. The method is also axiomatized by a set of inference rules for deducing the equivalence relation and the partial orders. We finally discuss the problem of visualizing contigs based on the information deduced by our method.
View full abstract
-
Ayumi Shinohara, Satoru Miyano, Setsuo Arikawa, Shinichi Shimozono, To ...
1993Volume 4 Pages
74-83
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We have developed a machine learning system BONSAI which gets positive and negative examples as inputs and produces a pair of a decision tree over regular patterns and an alphabet indexing as a hypothesis. This paper proposes two applications of BONSAI when we can run multiple BONSAI systems in parallel.
The one is to classify given examples which are coming from several different unknown classes. The process of solving the problem consists of multiply spawned BONSAI systems, each of which tries to find a decision tree, an alphabet indexing and a group of examples. It will finally partition a hodgepodge of sequences into a small number of disjoint classes together with hypotheses explaining these classes accurately.
The other is to find a good sample of a concept. Though the main interest of applying the BONSAI system is to discover good hypotheses, it is equally interesting to find a small set of examples from which a good hypothesis is made. We present a method for solving this problem by combining a strategy in genetic algorithms with multiply running BONSAI systems.
View full abstract
-
MASATO ISHIKAWA, TOMOYUKI TOYA, YASUSHI TOTOKI, AKIHIKO KONAGAYA
1993Volume 4 Pages
84-93
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
This paper proposes a new methodology to improve the performance of multiple sequence alignment by combining a genetic algorithm and an iterative alignment algorithm. Iterative alignment algorithms usually achieve better alignment than other alignment algorithms, such as tournament based multiple alignment. They, also, can incorporate parallelism to improve execution performance. However, they sometimes suffer from being trapped in the local optima and result in relatively low-quality alignments due to their rapid convergence. A genetic algorithm can save this problem by exchanging partial alignment sequences between “individuals”. Our experiments show that the combination of a genetic algorithm and an iterative alignment algorithm produces better results than iterative aligners which employ hill-climbing search strategies.
View full abstract
-
Shiho ARAKI, Masahiro GOSHIMA, Shin-ichiro MORI, Hiroshi NAKASHIMA, Sh ...
1993Volume 4 Pages
94-102
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
This paper makes two proposals to speed up the Parallel Iterative Method, which is based on the iterative strategy of the Berger-Munson algorithm.
The first proposal is to exploit finer-grained parallelism in the DP (Dynamic Programming) procedure itself. This proposal makes the processing speed proportional to the number of processors.
The second proposal is to apply the
A* algorithm, a well known heuristic search algorithm, instead of DP.
A* reduces the search space using heuristics, while DP traverses the whole space blindly.
We have implemented these two proposals on a parallel computer, the AP1000. In a test of parallelizing DP, ten 1000-character sequences are aligned by using 10 processors per one DP procedure at a speed 8.11 times faster than sequential processing. By applying the
A* algorithm to 30 sets of test problems, we obtain optimal alignment by reducing the search space by 95%.
View full abstract
-
Naoto Ukiyama, Hiroshi Imai
1993Volume 4 Pages
103-108
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
This paper addresses several issues in parallel multiple alignments, and reports some preliminary computational results of their implementation on CM5. Use of parallelism in the diagonal direction is laid stress on, which is quite useful especially when aligning similar strings. Some connection with the parallel approximate string matching algorithm by Landau and Vishkin [1] is also touched upon.
View full abstract
-
Osamu Gotoh
1993Volume 4 Pages
109-113
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Given a multiple sequence alignment of a family of protein or nucleotide sequences, conserved or highly variable regions are valuable landmarks to get insight into the functional and structural roles of individual regions. Conserved regions can also act as anchor points in the process of further improvement of the given alignment. Two different approaches were undertaken to extract conserved regions based on the principle of either consistency or high scores. The latter approach is easily modified to extract highly variable regions by reversing the scoring scheme. Examinations on a few protein families are discussed.
View full abstract
-
Y. Seto, Y. Ikeuchi, M. Isoyama
1993Volume 4 Pages
114-119
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Motifs are essential sites and therefor usually conserved in proteins. Motifs play a crucial role not only in protein world but also in genome projects. Their information are usually obtained by experiments and laborious multiple sequence alignment. Based on the fact that motifs are conserved short sequences, we developed method for extracting motifs automatically from pairwise sequence alignment. Moderately similar proteins for a probe protein are searched against all entries in sequence database. Motifs of a probe are then extracted from each pairwise alignment under the specified restrictions. We applied the method to 389 probe proteins from 89 superfamilies in PIR database and evaluated the extracted motifs.
View full abstract
-
Gen Shibayama, Hiroshi Imai
1993Volume 4 Pages
120-129
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Detecting similarities of multiple genome sequences is one of the most important topics in genome informatics. For the purpose of finding such similarities, an alignment with the highest score with respect to some similarity criterion is provided as an output. However, the alignment with the best score is not necessarily the most significant alignment of the sequences from the viewpoint of biology. In this respect, providing suboptimal alignments is very useful.
Since finding an alignment of sequences corresponds to finding a path in some directed acyclic graph, we propose a simple algorithm to enumerate all
K-best alignments in order, where
K may not necessarily be specified beforehand, by finding the
K longest paths in the graph. We further consider finding the subgraph formed by such
K longest paths. Several useful approaches to find the optimal paths in a graph are also mentioned.
View full abstract
-
Kiyoshi Asai, Hidetoshi Tanaka, Katunobu Itou, Kentaro Onizuka
1993Volume 4 Pages
130-139
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Hidden Markov Model (HMM) , a type of stochastic model (signal source), is now becoming popular in molecular biology. HMMs consist of ‘hidden’ states, statetransition probabilities and output distributions. Because there are known algorithms to train the HMMs as stochastic representations of the training data, they are widely used for pattern recognition, especially for speech recognition.
In the field of protein research, HMMs have been used to represent stochastic motifs of protein sequences, to model the structural patterns of protein, to predict the secondary structures and upper level structures, to make multiple sequence alignments, and to classify the protein sequences.
In each case, HMM techniques are closely related to the conventional methods. An important merit for using HMMs is their flexibility as a model of protein sequences. The serious problem of HMMs is that they need a large number of training data. In this paper, we give a brief introduction to HMMs, review HMM-related protein research, compare these research with the other methods and discuss the usefulness and further possibilities of HMMs.
View full abstract
-
KENTARO ONIZUKA, KIYOSHI ASAI, MASATO ISHIKAWA
1993Volume 4 Pages
140-151
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We propose a novel scheme for protein 3D structure prediction using the Multi-level Description scheme (MLD). In this prediction scheme, a local conformation is not only determined by the primary structure at that region (i. e., primary constraints) but is also constrained by the neighboring or surrounding local conformations (i. e., geometric constraints).
The MLD describes a protein conformation with multiple levels of different scales and degrees of abstraction. This scheme facilitate to model the geometric constraints between the neighboring local conformations by analyzing the frequency of overlapping patterns of the local conformations. The primary constraints are modeled by analyzing the relationship between the primary structure and the local conformation at that region.
The MLD representing a real protein conformation must satisfy most of the constraints above. Thus. a vrotein conformation can be predicted by searching for the optimal MLD that bset satisfies the constraints. This problems is formulated as a combinatorial optimization problem.
View full abstract
-
II. Tertiary Structure
Nobuhiko Saitô, Motonori Ota
1993Volume 4 Pages
152-156
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
The packing mechanism of the secondary structures has been revealed. The driving force is the hydrophobic interaction between hydrophobic residues which are located at nearest distance along the chain. They are chosen because they can be bound most quickly. In this way local structures of the protein are determined and thus glow into the whole structure. This process is usually done manually, but is now tried to be carried out automatically. This formulation is applied to crambin.
View full abstract
-
Makiko SUWA, Takatsugu HIROKAWA, Shigeki MITAKU
1993Volume 4 Pages
157-166
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
A theoretical method for structure prediction of membrane proteins was developed based upon physicochemical calculations, comprised of three steps. In the first step, the polar interaction field of a transmembrane helix was characterized by a probe helix method in which interaction energy between a transmembrane helix and a probe helix was calculated. A jigsaw puzzle problem in the second step was solved by using a binding maps of pairs of helices. Binding energy obtained from the polar interaction field was plotted in a binding map as functions of the orientation angles of the two helices. Finally, helix configuration determined by the analysis of binding maps was refined, minimizing the binding function of the whole system.
In order to deal with a jigsaw puzzle problem, several principles of the folding of membrane proteins have been assumed:(1) The molecular structure is formed according to some folding pathway.(2) The dominant interaction in hydrophobic region of membrane is the polar interaction.(3) Transmembrane helix can be regarded as a stable rod with charge distribution on it. The comparison of the predicted structure of bacteriorhodopsin with the experimental one revealed that the reconstruction of the relative position and the orientation of transmembrane helices is possible by this method. Applying this method to rhodopsin, the configuration of transmembrane helices was determined, which was quite similar to the experimental configuration of transmembrane helices. The mechanism of the structural change of rhodopsin by cis-trans isomerization of retinal was suggested from the predicted structure.
View full abstract
-
Zenmei OHKUBO, Minoru KANEHISA
1993Volume 4 Pages
167-174
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
In order to predict protein structures from their primary sequences, the understanding of long-range interactions is one of the most critical points. We are dealing with this problem by focusing on the pairs of peptide segments which are separated in the primary sequence but are close in the three-dimensional structure. The method is applied to a set of structure-resolved proteins to see if there are any significant features for association of local structures, such as secondary structure segments. The dataset consists of 88 nonhomologous proteins selected from the Brookhaven Protein Data Bank (PDB) using the superfamily classification of the Protein Information Resource (PIR). In the method, given the definition of the distance between two segments, spatially close segment-pairs are extracted for Ca segments of 4 or 7 residues long. The result shows that there are no preferred distances for association of two helical segments but there is a minimum of twenty intervening residues required for parallel helical segments.
View full abstract
-
Yôichi IIDA, Takeshi MASUDA
1993Volume 4 Pages
175-182
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Concerning the translation initiation signals in vertebrate mRNAs, not only ATG initiation codon but also sequences flanking the initiation codon are required to direct the position of initiation. A consensus sequence for the signal, (GCC) GCCGCCATGG, has been proposed by Kozak, but actual initiation sequences differ from it in a greater or lesser degree. In the present report, the translation initiation signal sequences of human β-globin and β-thalassemia mRNAs were analyzed using a quantification method proposed previously. In this method, each 16-nucleotide sequence in the mRNA was charactarized by its sample score, which shows intensity of the signal. Scoring of signal sequences could explain not only the authentic initiation site but also the experimental results of various mutations which took place around the initiation site. Further analysis demonstrated that, in addition to the signal intensity, the sequence nearest the cap site was preferred. This supported Kozak's scanning hypothesis, in which the eukaryotic small ribosomal subunit binds initially at the 5'-end of mRNA and subsequently migrates to the signal sequence.
View full abstract
-
Koji Tajima
1993Volume 4 Pages
183-187
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We propose a more sensitive algorithm for multiple sequence alignment using parallel genetic algorithms. With less computation than that needed for multi-dimensional dynamic programming approaches, we can obtain multiple alignments which have better similarity than that obtained by repeating two-dimensional dynamic programming. The parallel processing of genetic algorithms was performed on a Fujitsu parallel computer AP1000.
View full abstract
-
Tsuyoshi Yoshizawa, Masaki Fumoto, Tamio Yasukawa
1993Volume 4 Pages
188-196
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
A spin glass model for polypeptide chains consisting of 4 states
a,
b,
c1 and
c2, was introduced for the energy minimal conformation search by an extended Hopfield algorithm, in which energy dissipation rate was gradually reduced to simulate annealing processes. Inter-residue interaction energies were estimated by molecular mechanics program AMBER using model oligopeptide chains and crystal structure data. Preliminary results obtained with BPTI are not so satisfactory and several measures to improve the prediction accuracy were discussed.
View full abstract
-
Fumiyoshi Sasagawa, Koji Tajima
1993Volume 4 Pages
197-204
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Usually, the prediction of protein secondary structure by a neural network is based on three states (α-helix, β-sheet and coil). However, a recent report of protein of which structure is determined presents more detailed secondary structure as 3
10-helix. It is expected that more detailed secondary structure of protein should be predicted. In application of neural network to the prediction of multi-states secondary structures, some problematic points are discussed. The prediction of globular protein secondary structures is studied by a neural network. The application of a neural network with a modular architecture to prediction of protein secondary structures (α-helix, β-sheet and coil) is presented. Each module is a three layer neural network. The results from the neural network with a modular architecture and with a simple three layer structure are compared. Overlearning effect is investigated in ordinary and modular neural networks. The prediction accuracy by a neural network with a modular architecture is higher than of the ordinary neural network. The 3, 4 and 8 state classification scheme of secondary structures are considered in the ordinary three layer neural network. The percentage of correct prediction depends on these state classification method. Furthermore, for 3 and 4 state classification scheme of protein secondary structures, the consistencey of outputs of modules on the neural network with modular architecture is investigated.
View full abstract
-
Kotoko Nakata
1993Volume 4 Pages
205-210
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Using the neural network algorithm with back-propagation traing procedure, we analysed the zinc finger DNA binding protein sequences. The patterns which were used in the neural network are amino acids sequence pattern, the electric charge and polarity, amino acids group properties, amino acids ancestral group, hydrophobicity, hydrophilicity and the secondary structure. For the comparison, th e discriminant analysis was also tried. As for the TFIIIA type (C
ys-X
2-4-C
ys-X
12-15-H
is-X
3-5-H
is)(X is any amino acid) zinc finger DNA binding motifs, the prediction results reached high discrimination in the neural network algorithm and the discriminant analysis. Although each result of single perceptron algorithm is not always good in the case of the estrogen type (C
ys-X
2-4-C
ys-X
12-15-C
ys-X
2-4-C
ys) zinc finge, the combination of the attributes reached high discrimination.
View full abstract
-
Motif Evaluation on a 3-D Structure
Kazuhiro Iida, Hiroshi Mamitsuka
1993Volume 4 Pages
211-218
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
A probabilistic logic neural network, mSDN reveals multiple biochemical rules hidden in a protein amino-acid sequence. Two motifs are extracted from a 16-residue hemoglobin α-helix region. The motifs each containing only 3 amino-acid residues, correctly classify new data with 96% accuracy. Evaluating the motifs on a hemoglobin 3-D structure suggests that one motif represents a local α-helix determiner, and the other explains long-range interactions which are important for hemoglobin tertiary structure. The findings indicate that the mSDN extracts region specific and biochemically significant motifs from an amino-acid sequence, and suggest that the network separates heterogeneous biochemical rules in a sequence into corresponding motifs. Motifs extracted by the mSDN will help us to analyze, and to predict protein conformations and its functions.
View full abstract
-
Koichi NIIJIMA, Shinichi SHIMOZONO
1993Volume 4 Pages
219-223
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Domains of classifying positive and negative patterns are derived by imposing some heteroassociative output conditions on the network. Using the shape of the domain, a functional to be minimized is introduced to determine connection weights and threshold values of the network. Minimization techniques of the functional, which give learning algorithms of the network, are also discussed. In the last, remarks on numerical experiments are described.
View full abstract
-
Hidetoshi Tanaka, Kentaro Onizuka, Kiyoshi Asai
1993Volume 4 Pages
224-230
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Hidden Markov Model (HMM) introduces a stochastic approach to protein representation and motif abstraction. We need the stochastic classification which is seamless with HMM representation and abstraction.
Successive State Splitting (SSS) classifies proteins represented by HMM. It uses no previous knowledge of the proteins. The SSS algorithm was originally developed for
allophone modeling. It is based on continuous distribution of phenome data. It enables to obtain an appropriate
Hidden Markov Network automatically, and HMM simultaneously. We map amino acids onto continuous space according to quantification based on PAM-250.
View full abstract
-
Shigehiko Kanaya, Yoshihiro Kudo
1993Volume 4 Pages
231-238
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
In order to examine differences of preferential usage of synonymous codons among species systematically, principal component analysis is applied to a matrix consisting of relative frequencies in synonymous codons. The first two principal components (PC1 and PC2) account for 66% and 8%, respectively. From the PC projection by the first two components, the following conclusion can be obtained:(1) The base-preference of A and U (G and C) at the third position in synonymous codon contributes negatively (positively) to the PC1: Vertebrates and chloroplasts are clusterized in narrow regions with positive and the most negative PC1, respectively.(2) The PC2 is important to distinguish between prokaryotes and (eukaryotes: Eukaryotes prokaryotes) prefer di-nucleotides GA, AG, CU and CA (CG, GC, and AA) at the second and the third positions in codons.
View full abstract
-
Jun Kusuda, Makoto Hirata, Atushi Toyoda, Ichiro Takahashi, Katsuyuki ...
1993Volume 4 Pages
239-244
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
To estimate the frequency in the association of CpG islands with genes distributed in human genome, we have screened the statistically expected CpG islands for sequenced human DNAs compiled in DNA database. The survey of 2605 genomic sequences (>300 bp) coding 833 genes mapped on human chromosomes identified 1030 CpG island-linked sequences classified to 324 genes, indicating that at least 39% of human genes are coincided with CpG islands. Furthermore, it is found that 19%, 36% and 45% of CpG islands mapped on single chromosomal bands are located on G-, R- and T-bands. This result suggests that the occurrence of CpG island-genes increases with increasing the global G+C% level of chromosomal bands.
View full abstract
-
Mikita Suyama, Takaaki Nishioka, Jun'ichi Oda
1993Volume 4 Pages
245-254
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
To extract the ligand-related motifs from the sequences of enzymes, we have constructed Ligand Chemical Database for Enzyme Reaction that links a chemical compound to amino acid sequences. Among 1, 966 ligands registered, 519 chemical compounds were related to 1, 488 ligand-linked sequences. Sequence fragments of 10-residue long, commonly found among the ligand-linked sequences for each chemical compound, were defined as ligand-related motifs. Motifs extracted for pyridoxal phosphate were tested against the crystal structures of aspartate aminotransferase complexed with pyridoxal phosphate. Twenty-four motifs among 93 motifs extracted from the enzyme include the residues that make chemical interactions with the bound pyridoxal phosphate. One of the motifs, K-x-x-G-L-x-x-x-R-V, actually participates in the recognition of pyridoxal phosphate in another enzyme, 1-aminocyclopropanel--carboxylate synthase. The present approach provides the ligand-related motifs and shows great potentials to characterize the unknown genes sequenced by the genome project.
View full abstract
-
Ikuo Uchiyama, Atsushi Ogiwara, Zenmei Ohkubo, Minoru Kanehisa
1993Volume 4 Pages
255-263
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
A method is described for extracting signature pentapeptides that are conserved and exclusively found in a group of homologous proteins. The BLAST algorithm is used to count the frequency of occurrences of pentapeptide patterns allowing limited substitutions, as well as to perform homology search. For those pentapeptides that appear in a given sequence we examine the frequency of occurrences of these pentapeptides and related ones in homologous sequences which are ordered according to the homology score. By comparing against the frequency in the entire database, we can extract uniquely conserved pentapeptides and at the same time perform a grouping of homologous sequences. Thus, our procedure can automatically identify, if any, pentapeptides that are strongly tied with the group. Possibility of using our pentapeptide word dictionary to infer protein function is discussed.
View full abstract
-
Keiichi Nagai, Tetsuo Nishikawa, Hideki Kambara, Toshihisa Takagi
1993Volume 4 Pages
264-269
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Conventional database search programs for finding local similarities in protein and DNA sequences, such as the one based on the Smith-Waterman algorithm, FASTA, and BLAST, can contain subregions having high similarity, low similarity, and even no similarity. We propose a simple method for finding significant local sequence similarity regions, where the alignment results of two sequences are graphed as integrated scores calculated along the aligned sequences using the match, mismatch, and gap penalty scores. This method has been used to find local similarity subregions in alignment results obtained by BLAST or the Smith-Waterman algorithm. Potential applications for finding domain structures and the characteristic sequence patterns are also shown.
View full abstract
-
Hiroyuki Ogata, Yutaka Akiyama, Minoru Kanehisa
1993Volume 4 Pages
270-274
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We are developing a computational method for automatically organizing collections of structural knowledge of RNA into a three-dimensional (3-D) form. The goal of our method for modeling of RNAestructure is to find, ase much as possible, conformations of RNA which satisfy the constraints frome experiments and sequence analysis and, at the same time, whose local conformations are close to some representative conformations. For efficient conformational search, we used a genetic algorithm as a trial. We applyed our method in modeling a single stranded region of an RNA for the estimation of efficiency of our method.
View full abstract
-
Wataru Fujibuchi, Minoru Kanehisa
1993Volume 4 Pages
275-282
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We constructed a dictionary of sequence motifs for transcription regulation with a heuristic method from a set of DNA sequences upstream of the transcription initiation site. The method first identifies wealdy conserved blocks within a given region relative to the initiation site by the search and merge of six-base patterns. Then most conserved portions of these blocks are extracted by calculating the information content after similar blocks are multiply aligned. The procedure was applied to primate promoters and the result was evaluated with the Transcription Factor Database (TFD). The result will give us new biological insights into the DNA signals.
View full abstract
-
Takashi Yokomori, Satoshi Kobayashi
1993Volume 4 Pages
283-292
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We propose a simple string similarity measure and apply it to the problem of DNA sequence analysis, more specifically, to the problem of analysing molecular evolution. This measure is based on a “local feature” that was motivated from a theoretical characterization on DNA splicing sequences.
We demonstrate the usefulness of the proposed measure by presenting an experimental result which concerns evolutionary molecular analysis. This sheds new light on the other types of DNA sequence analysis such as protein classification, motif identification.
View full abstract
-
Yukio Kobayashi, Nobuhiko Saitô
1993Volume 4 Pages
293-299
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Statistical mechanical method is proposed to predict the secondary structures of globular proteins. Three-state prediction which provides simultaneously the probabilities of α-helix, β-strand and coil is performed with a recurrence method. The probabilities of the ith residue in a-helix or in β-strand are calculated with statistical weights for amino acid pairs in a-helix or in β-strand. We determine the statistical weights to yield the correct predictions for the proteins with known structures instead of calculating directly the interaction energies between residues. To do this, we introduce an objective function and estimate the weights so as to minimize this function by referring to the proteins for optimization. This method yields prediction accuracy of 67% for 13 proteins for accuracy estimation. This value does not exceed the best values obtained by the method based on homology. However, we have a hope to improve the accuracy, since we can analyze the reasons for poor accuracy in contrast to other methods.
View full abstract
-
Khawaja Sirajuddin, Tomomasa Nagashima, Koichi Ono
1993Volume 4 Pages
300-305
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
The consensus sequence for 5'-splice site has been proposed as CAG/GTGAGT. But the actual splice site sequence differs from it at a certain extent more or less. In this paper we analyze various mammalian globin genes using the induction of decision tree. We have found that the prediction rate for discriminating unknown sequences increases in accordance with the increase of the rate of false splice site sequences with dinucleotide GT at 4th and 5th position in the learning data set.
View full abstract
-
Hiroshi FURUTANI
1993Volume 4 Pages
306-314
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We have developed networks with back-propagation learning algorithm for the prediction of splice sites in mRNA precursors. We used these networks to predict the effects of mutations on splicing of protein coding genes. We applied neural networks to β-thalassemia genes (mutant β-globin genes), a hemophilia B gene (mutant blood coagulation factor IX gene) and a mutant c-Ha-ras oncogene. We demonstrate that these networks predict abnormal splicing patterns in these genes consistent with experiments.
View full abstract
-
Motokazu KAMIMURA, Yoshimasa TAKAHASHI
1993Volume 4 Pages
315-324
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
In this paper we aim to examine in detail the data distribution within each conformational pattem class and to identify some local common structural features among the fragments in a particular cluster (or subcluster). Backbone conformational pattern clustering was carried out for the three-dimensional peptide fragments where the Φ-ψ, conformational pattern of the TA (target amino acid) belongs to class A (α-helix dominant class) or β(n-sheet dominant class) as defined in our previous work. The analysis for the fragments of class A suggested that these fragments involve four representative local backbone conformational patterns, not only for typical α-helix fragments but also for fragments closely related to
type I turn or the starting moieties of α-helices. On the other hand, the analysis for class B fragments showed that these have much more diversity than class A fragments with respect to their local backbone structures. The details of the methods and results of the analyses are discussed here.
View full abstract
-
Koji Ohnishi
1993Volume 4 Pages
325-331
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
The
Bacillus subtlis trrnD operon has a structure of 5'[16S rRNA-23S rRNA-5S rRNA-(RNA)
16] 3'. The tRNA duster in this operon includes 16 tandemly repeated tRNA genes (denoted by “poly-tRNA structure”), in which ordering of amino acid (aa) specificities of these tRNA is “NSEVMD FT YWHQ GCLL”. An ancient “trrnD -peptide” possessing this aa sequence was hypothesized, and protein sequence regions similar to tanD-peptide were searched for from PIR Proein Sequence Database. The aa's 139-156 in the
E. coli Gly-tRNA synthetase (GIyRS) a subunit was found to be most similar to this peptide.
Further analysis revealed that not only the GIyRS gene encoding GIyRS α, but also the a gene of
Synechococcus 6301 encoding F
0-ATPase a subunit, are both true homologues of the BSU
trrnD poly-tRNA region. These findings strongly support the recently proposed “poly-tRNA theory”(Ohnishi, 1993) on the origin of mRNA and genetic codes. Thus it has now been concluded that the
trrnD polytRNA region is a relic of aost primitive RNA molecule capable of synthesizing a
trrnD-peptide-like primitive peptide in early life. The most paradoxical problem on the origin of genetic codes seems to have been basically solved from the aspect of poly-tRNA theory.
View full abstract
-
Tsukasa Sakai
1993Volume 4 Pages
332-338
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Substitution odds r (i, j), for amino acid residues, can be transformed to similarities s (i, j) by normalizing with geometric average of conservative odds r (i, i) and r (j, j). Similarities thus derived for all twenty natural amino acid residues in proteins, conform to the range 0 to 1, and have complementary dissimilarities. Empirical test has qualified that the dissimilarity satisfies all metric requirements as distance between residues. Relative certainty, as identity index, calculated from both similarity and dissimilarity, can be used as matching scores, consistent with both of them, in protein sequence comparison.
View full abstract
-
Takashi Ishikawa, Takao Terano
1993Volume 4 Pages
339-346
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
This paper describes a computational method to predict a protein structure by analogical reasoning from known protein structures. The proposed method:
Analogy by Abstraction uses heuristics to reduce the search complexity to get appropriate transformations to create a structure of the unknown protein form a known protein structure. We implement an algorithm of the method in Prolog programing language, and exemplify its effectiveness by re-predicting the structure of ‘Zinc fingers’ from its amino-acid sequence.
View full abstract
-
K. Wada, Y. Wada, S. Tanaka, H. Doi, Y. Nakamura, K. Sugaya, T. Fukaga ...
1993Volume 4 Pages
347-351
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
-
Susumu Goto, Toshihisa Takagi, Norihiro Sakamoto
1993Volume 4 Pages
352-361
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Recently, many genome databanks were developed as a result of growing genome project activities. Each of them consists of a large amount and variety of data, and they were developed independently. Therefore, their integration and efficient management of the data are required. It is also necessary to develop a framework for easily building and testing biological hypotheses with the integrated database. We developed a deductive objectoriented database for searching an integrated database, acquiring new knowledge from it, and storing the knowledge in the database. It consists of an object-oriented database that integrates the conventional genome databases such as GenBank, and deductive language interface for genome analysis. In this paper, we present an overview of the system and examples of analyses using the database.
View full abstract
-
Takahiko SUZUKI, Susumu NAKASHIMA, Toshihisa TAKAGI, Satoru KUHARA, Mi ...
1993Volume 4 Pages
362-369
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
An integrated database system “HyperGenome” for genome maps and DNA sequences was developed. The system can handle two different types of data, each of which has an unique complex structure. Graphical user interface (GUI) enables ready retrieval of information obtained from genome mapping data and data on DNA sequences. Data on mapping are derived from the Genome Data Base (GDB) and sequence data are from GenBank.
The following information was added to the system. 1. Mendelian Inheritance in Man (MIM) entries can be linked to a locus in our system. 2. Amino asid sequences from Protein Identification Resources (PIR) can be displayed, in conjunction with the nucleotide sequence.
View full abstract
-
Shinsei Minoshima, Nobuyoshi Shimizu
1993Volume 4 Pages
370-375
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
We developed a new database system, Locus-in, to enter raw mapping data and construct integrated maps. This system works on Sun workstation with X-window and a graphic library, Motif. The system supports full graphical user interface. It has the following unique functions:(1) to zoom-in on a specific region of interest;(2) to generate a number of sub-windows associated with a specific region for entry and display of data (each subwindow accepts either ordered or not ordered and either raw or published data); and (3) to create new breakpoints. The current version of Locus-in will be demonstrated at the workshop.
View full abstract
-
Akira Suyama, Masami Hagiya, Takashi Ito, Asao Fujiyama, Akira Ohyama, ...
1993Volume 4 Pages
376-384
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
ContigMaker is a software tool to aid contig map construction. It is a Motif application running on UNIX workstations with the X Window System. ContigMaker is composed of five major components: map data manager, map analyzer, map viewer, map aid, and project manager. Contig-mapping data obtained by experiments are stored in a database of the map data manager. The stored data are then subjected to analysis by the map analyzer to generate contigs. ContigMaker supports the two strategies for contig construction: the STS (sequence-tagged sites) strategy and the MOF (mapping by oligonucleotide fingerprinting) strategy. The generated contigs are assembled into a contig map according to positions of landmarks falling on the contigs. ContigMaker allows a user to extract landmark information from a public genome database such as the GDB. The contig maps constructed are graphically drawn by the map viewer. The map aid provides miscellaneous small useful tools to finish a contig-mapping task. A repeated task ContigMaker performs can be automated by a macro created by the project manager. The macro will save time and effort for contig map construction.
View full abstract
-
Toshiyuki Niiyama, Takeo Tokimori, Atsushi Ogiwara, Ikuo Uchiyama, Ken ...
1993Volume 4 Pages
385-393
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
GNOME is a sequence data management tool through which users can efficiently access e-mail servers for various molecular biological analyses on Internet including GenomeNet. It supports BLAST/FASTA servers for homology searches, PROSITE/MotifDic servers for motif searches, and bget/bfind servers for DB entry retrievals. One of its most eminent features is that it can not only send e-mails for queries but also receive and manage e-mails for replies. In addition, its interface is very user-friendly. Therefore, it should considerably enhance efficient and profound analyses of newly-determined sequence data in both individual biological researches and large-scale genome projects
View full abstract
-
Yutaka Akiyama, Hirotada Mori, Satoru Kuhara, Naoki Ogasawara, Nobuyuk ...
1993Volume 4 Pages
394-401
Published: 1993
Released on J-STAGE: July 11, 2011
JOURNAL
FREE ACCESS
Genomatica is an integrated software tool designed for helping systematic management of a large number of DNA sequence fragments obtained through a genome sequencing project.
Its graphic user-interface also allows users to look, with any magnifying factor, into any position of the specified chromosome and to browse various kinds of collected information altogether (including: DNA sequence itself, related gene descriptions, bibliographic references, corresponding GenBank entries, confirmed or putative coding regions, results from homology analysis for the expected protein, RNA genes, clone information, enzyme restriction maps, comments from administrator, private memorandums by user).
We are planning to use Genomatica in
E. coli (local data compilation mainly managed by Mori),
B. subtilis (by Ogasawara), and
S. cerevisiae (by Murakami) genome sequencing projects.
The Genomatica project was started on 1992 as one of the advanced genome database projects sponsored by Human Genome Center, University of Tokyo. In June 1993, ver. 2.0 which was fully re-designed with NCBI vibrant library was released. Further augmented version Genomatica 2.1 (with several sequence analysis functions and network communication modules) will be released on Nov. 1993 and will be distributed through anonymous ftp services. The Genomatica system is currently available for X11 window system on Unix workstations, but Macintosh and IBM-PC versions will be also announced soon.
View full abstract