Genome Informatics
Online ISSN : 2185-842X
Print ISSN : 0919-9454
ISSN-L : 0919-9454
Volume 17, Issue 2
Displaying 1-30 of 30 articles from this issue
  • Kazuhito Shida
    2006 Volume 17 Issue 2 Pages 3-13
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The difficulties of computational discovery of transcription factor binding sites (TFBS) are well represented by (l, d) planted motif challenge problems. Large d problems are difficult, particularly for profile-based motif discovery algorithms. Their local search in the profile space is apparently incompatible with subtle motifs and large mutational distances between the motif occurrences.
    Herein, an improved profile-based method called GibbsDST is described and tested on (15, 4), (12, 3), and (18, 6) challenging problems. For the first time for a profile-based method, its performance in motif challenge problems is comparable to that of Random Projection. It is noteworthy that GibbsDST outperforms a pattern-based algorithm, WINNOWER, in some cases. Effectiveness of GibbsDST using a biological dataset as an example and its possible extension to more realistic evolution models are also introduced.
    Download PDF (1352K)
  • Vipin Narang, Wing-Kin Sung, Ankush Mittal
    2006 Volume 17 Issue 2 Pages 14-24
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Drosophila melanogaster is one of the most important organisms for studying the genetics of development. The precise regulation of genes during early development is enacted through the control of transcription. The control circuitry is hardwired in the genome as clusters of multiple transcription factor binding sites (TFBS) known as cis-regulatory modules (CRMs). A number of TFBS and CRMs have been experimentally annotated in the Drosophila genome. Currently about 661 CRM sequences are known, of which 155 have been annotated with 778 TFBS. This work attempts computational annotation of TFBS in the remaining 506 uncharacterized Drosophila CRMs. The difficulty of this task lies in the fact that experimental data is insufficient for constructing reliable positional weight matrices (PWM) to predict the TFBS. Thus a novel feature extraction and classification method for TFBS detection has been implemented in this work. The method achieves both high sensitivity and low false positive rate in cross-validation studies. As a result of this work, a new database has been compiled which aggregates all the CRM and TFBS annotation information for Drosophila available to date, and appends new TFBS annotations.
    Download PDF (3495K)
  • Tetsuji Kuboyama, Kouichi Hirata, Kiyoko F. Aoki(Kinoshita), Hisashi K ...
    2006 Volume 17 Issue 2 Pages 25-34
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We propose a novel general-purpose tree kernel and apply it to glycan structure analysis. Our kernel measures the similarity between two labeled trees by counting the number of common q-length substrings (tree q-grams) embedded in the trees for all possible lengths q. We apply our tree kernel using a support vector machine (SVM) to classification and specific feature extraction from glycan structure data. Our results show that our kernel outperforms the layered trimer kernel of Hizukuri et al.[9] which is well tailored to glycan data while we do not adjust our kernel to glycanspecific properties. In addition, we extract specific features from various types of glycan data using our trained SVM. The results show that our kernel is more flexible and capable of finding a wider variety of substructures from glycan data.
    Download PDF (1166K)
  • Thanh Phuong Nguyen, Tu Bao Ho
    2006 Volume 17 Issue 2 Pages 35-45
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The objective of this paper is twofold. One objective is to present a method of predicting signaling domain-domain interactions (signaling DDI) using inductive logic programming (ILP), and the other is to present a method of discovering signal transduction networks (STN) using signaling DDI.
    The research on computational methods for discovering signal transduction networks (STN) has received much attention because of the importance of STN to transmit inter-and intra-cellular signals. Unlike previous STN works functioning at the protein/gene levels, our STN method functions at the protein domain level, on signal domain interactions, which allows discovering more reliable and stable STN. We can mostly reconstruct the STN of yeast MAPK pathways from the inferred signaling domain interactions, with coverage of 85%. For the problem of prediction of signaling DDI, we have successfully constructed a database of more than twenty four thousand ground facts from five popular genomic and proteomic databases. We also showed the advantage of ILP in signaling DDI prediction from the constructed database, with high sensitivity (88%) and accuracy (83%). Studying yeast MAPK STN, we found some new signaling domain interactions that do not exist in the well-known InterDom database. Supplementary materials are now available from http://www.jaist.ac.jp/s0560205/STP_DDI/.
    Download PDF (1356K)
  • Joséc. Clemente, Kenji Satou, Gabriel Valiente
    2006 Volume 17 Issue 2 Pages 46-56
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Using a metabolic pathway alignment method we developed, we studied highly conserved reactions in different groups of organisms and found out that biological functions vital for each of the groups are effectively expressed in the set of conserved reactions. We also studied the metabolic alignment of different strains of three bacteria and found out several non-conserved reactions. We suggest that these reactions could be either misannotations or reactions with a relevant but yet to be specified biological role, and should therefore be further investigated.
    Download PDF (1298K)
  • Pablo Minguez, Fátima Al-Shahrour, Joaquín Dopazo
    2006 Volume 17 Issue 2 Pages 57-66
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The interpretation of microarray experiments is commonly addressed by means a two-step approach in which the relevant genes are firstly selected uniquely on the basis of their experimental values (ignoring their coordinate behaviors) and in a second step their functional properties are studied to hypothesize about the biological roles they are fulfilling in the cell. Recently, different methods (e.g. GSEA or FatiScan) have been proposed to study the coordinate behavior of blocks of functionally-related genes. These methods study the distribution of functional information across lists of genes ranked according their different experimental values in a static situation, such as the comparison between two classes (e.g. healthy controls versus diseased cases). Nevertheless there is no an equivalent way of studying a dynamic situation from a functional point of view.
    We present a method for the functional analysis of microarrays series in which the experiments display autocorrelation between successive points (e.g. time series, dose-response experiments, etc.) The method allows to recover the dynamics of the molecular roles fulfilled by the genes along the series which provides a novel approach to functional interpretation of such experiments. The method finds blocks of functionally-related genes which are significantly and coordinately overexpressed at different points of the series. This method draws inspiration from systems biology given that the analysis does not focus on individual properties of genes but on collective behaving blocks of functionally-related genes.
    The FatiScan algorithm used in the method proposed is available at: http://fatiscan.bioinfo.cipf.es, or within the Babelomics suite: http://www.babelomics.org. Additional material is available at: http://bioinfo.cipf.es/data/plasmodium
    Download PDF (3620K)
  • Paul B. Horton, Larisa Kiseleva, Wataru Fujibuchi
    2006 Volume 17 Issue 2 Pages 67-76
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    In this paper we present a fast algorithm and implementation for computing the Spearman rank correlation (SRC) between a query expression profile and each expression profile in a database of profiles. The algorithm is linear in the size of the profile database with a very small constant factor. It is designed to efficiently handle multiple profile platforms and missing values. We show that our specialized algorithm and C++ implementation can achieve an approximately 100-fold speed-up over a reasonable baseline implementation using Perl hash tables.
    RaPiDS is designed for general similarity search rather than classification - but in order to attempt to classify the usefulness of SRC as a similarity measure we investigate the usefulness of this program as a classifier for classifying normal human cell types based on gene expression. Specifically we use the k nearest neighbor classifier with a t statistic derived from SRC as the similarity measure for profile pairs. We estimate the accuracy using a jackknife test on the microarray data with manually checked cell type annotation. Preliminary results suggest the measure is useful (64% accuracy on 1, 685 profiles vs. the majority class classifier's 17.5%) for profiles measured under similar conditions (same laboratory and chip platform); but requires improvement when comparing profiles from different experimental series.
    Download PDF (1030K)
  • Xudong Dai, Yudong D. Hel, Hongyue Dai, Pek Y. Lum, Christopher J. Rob ...
    2006 Volume 17 Issue 2 Pages 77-88
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Toxicity is a major cause of failure in drug development. A toxicogenomic approach may provide a powerful tool for better assessing the potential toxicity of drug candidates. Several approaches have been reported for predicting hepatotoxicity based on reference compounds with well-studied toxicity mechanisms. We developed a new approach for assessing compound-induced liver injury without prior knowledge of a compound's mechanism of toxicity. Using samples from rodents treated with 49 known liver toxins and 10 compounds without known liver toxicity, we derived a hepatotoxicity score as a single quantitative measurement for assessing the degree of induced liver damage. Combining the sensitivity of the hepatotoxicity score and the power of a machine learning algorithm, we then built a model to predict compound-induced liver injury based on 212 expression profiles. As estimated in an independent data set of 54 expression profiles, the built model predicted compound-induced liver damage with 90.9% sensitivity and 88.4% specificity. Our findings illustrate the feasibility of ab initio estimation of liver toxicity based on transcriptional profiles.
    Download PDF (6233K)
  • Kang Ning, Hon Wai Leong
    2006 Volume 17 Issue 2 Pages 89-99
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    As the scale of the microarray experiments increases, a single oligo nucleotide array is no longer large enough. Therefore, the use of multiple oligo arrays for one experiment becomes more important. The design and synthesis of multiple arrays to minimize the overall synthesis cost is an interesting and important problem. We formulate the multiple array synthesis problem (MASP) that deals with the distribution of the probes (or oligos) to different arrays, and then depositionof the probes onto each array. We propose a cost function to capture the synthesis cost and a performance ratio for analysis of the quality of multiple arrays produced by different algorithms. We propose a Distribution and Deposition Algorithm (DDA) for the solving the MASP. In this algorithm, the probes are first distributed onto multiple arrays according to their characteristics such as GC contents. Then the probes on each arrays are deposited using a good deposition algorithm. Two other algorithms were also proposed and used for comparison. Experiments show that our algorithm can effectively output short synthesis sequences for multiple arrays, and the algorithm is efficient.
    Download PDF (1445K)
  • Tomokazu Konishi
    2006 Volume 17 Issue 2 Pages 100-109
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Gene expression microarray data often include problems caused by uneven hybridization and dust contamination. Such problems should be removed prior to analysis to prevent degradation of analytical accuracy and false positive results. This paper presents a parameter-scanning algorithm to detect such defects on the basis of the character of data distributions. The cell data is thoroughly scanned using a window algorithm, and windows with an index value greater than a threshold are recognized as defects and removed from the array data. The index is found from the differences between the target and an ideal standard of hybridization obtained as a trimmed mean among experiments, representing the statistical center of differences in each section. The threshold is derived as a screening level designated by the operator, but has only limited effect on the effectiveness of data cancellation. The validity of the algorithm and the effects of data cancellation are tested using GeneChip data obtained from a series of experiments. The algorithm is demonstrated to greatly improve the reproducibility of measurements, and removes only a small number of faultless data.
    Download PDF (1258K)
  • Wen-Juan Hou, Kevin Hsin-Yih Lin, Hsin-Hsi Chen
    2006 Volume 17 Issue 2 Pages 110-120
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Gene Ontology (GO) is developed to provide standard vocabularies of gene products in different databases. The process of annotating GO terms to genes requires curators to read through lengthy articles. Methods for speeding up or automating the annotation process are thus of great importance. We propose a GO annotation approach using full-text biomedical documents for directing more relevant papers to curators. This system explores word density and gravitation relationships between genes and GO terms. Different density and gravitation models are built and several evaluation criteria are employed to assess the effects of the proposed methods.
    Download PDF (1261K)
  • The Construction and Use of Protein Description Sentences
    Martin Krallinger, Rainer Malik, Alfonso Valencia
    2006 Volume 17 Issue 2 Pages 121-130
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Existing biological knowledge stored as structured database records has been extracted manually by database curators analyzing the scientific literature. Most of this information was derived from sentences which describe biologically relevant aspects of genes and gene products. We introduce the Protein description sentence (Prodisen) corpus, a useful resource for the automatic identification and construction of text-based protein and gene description records using information extraction and text classification techniques. Basic guidelines and criteria relevant for the construction of a text corpus of functional descriptions of genes and proteins are proposed. The steps used for the corpus construction and its features are presented. Moreover, some of the potential applications of the Prodisen corpus for biomedical text mining purposes are explored and the obtained results are presented.
    Download PDF (1288K)
  • Gabriel Valiente
    2006 Volume 17 Issue 2 Pages 131-140
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The comparative analysis of phylogenies obtained using different phylogenetic methods or different gene sequences for a given set of species, is usually done by computing some quantitative measure of similarity between the phylogenetic trees. Such a quantitative approach provides little insight into the actual similarities and differences between the alternative phylogenies.
    In this paper, we present a method for the qualitative assessment of a phylogenetic tree against a reference taxonomy, based on highlighting their common clusters. Our algorithms build a reference taxonomy for the taxa present in a given phylogenetic tree and produce a dendogram for the input phylogenetic tree, with branches in those clusters common to the reference taxonomy highlighted. Our implementation of the algorithms produces publication-quality graphics.
    For unrooted phylogenies, the method produces a radial cladogram for the input phylogenetic tree, with branches in common clusters to the reference taxonomy highlighted.
    Download PDF (1581K)
  • Le Sy Vinh, Andrés Varón, Ward C. Wheeler
    2006 Volume 17 Issue 2 Pages 141-151
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The increase of available genomes poses new optimization problems in genome comparisons. A genome can be considered as a sequence of characters (loci) which are genes or segments of nucleotides. Genomes are subject to both nucleotide transformation and character order rearrangement processes. In this context, we define a problem of so-called pairwise alignment with rearrangements (PAR) between two genomes. The PAR generalizes the ordinary pairwise alignment by allowing the rearrangement of character order. The objective is to find the optimal PAR that minimizes the total cost which is composed of three factors: the edit cost between characters, the deletion/insertion cost of characters, and the rearrangement cost between character orders. To this end, we propose simple and effective heuristic methods: character moving and simultaneous character swapping. The efficiency of the methods is tested on Metazoa mitochondrial genomes. Experiments show that, pairwise alignments with rearrangements give better performance than ordinary pairwise alignments without rearrangements. The best proposed method, simultaneous character swapping, is implemented as an essential subroutine in our software POY version 4.0 to reconstruct genome-based phylogenies.
    Download PDF (1076K)
  • Arjun Bhutkar, Susan Russo, Temple F. Smith, William M. Gelbart
    2006 Volume 17 Issue 2 Pages 152-161
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Genome scale synteny analysis, the analysis of relative gene-order conservation between species, can provide key insights into evolutionary chromosomal dynamics, rearrangement rates between species, and speciation analysis. With the rapid availability of multiple genomes, there is a need for efficient solutions to aid in comparative syntenic analysis. Current methods rely on homology assessment and multiple alignment based solutions to determine homologs of genetic markers between species and to infer syntenic relationships. One of the primary challenges facing multigenome syntenic analysis is the uncertainty posed by genome assembly data with un-sequenced gaps and possible assembly errors. Currently, manual intervention is necessary to tune and correct the results of homology assessment and synteny inference. This paper presents a novel automated approach to overcome some of these limitations. It uses a graph based algorithm to infer sub-graphs denoting synteny chains with the objective of choosing the best locations for homologous elements, in the presence of paralogs, in order to maximize synteny. These synteny chains are expanded by merging sub-graphs based on various user defined thresholds for micro-syntenic scrambling. This approach comprehends and accommodates for contig and scaffold gaps in the assembly to determine homologous genetic elements that might either fall in unsequenced assembly gaps or lie on the edges of sequenced segments or on small fragments. Furthermore, it provides an automated solution for breakpoint analysis and a comparative study of chromosomal rearrangements between species. This approach was applied to a comparative study involving Drosophila. melanogaster and Drosophila.pseudoobscura genomes, as an example, and has been useful in analyzing inter-species syntenic relationships.
    Download PDF (1037K)
  • Rui-Sheng Wang, Ling-Yun Wu, Xiang-Sun Zhang, Luonan Chen
    2006 Volume 17 Issue 2 Pages 162-171
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Single nucleotide polymorphism (SNP) is the most frequent form of human genetic variations and of importance for medical diagnosis and tracking disease genes. A haplotype is a sequence of SNPs from a single copy of a chromosome, and haplotype assembly from SNP fragments is based on DNA fragments with SNPs and the methodology of shotgun sequence assembly. In contrast to conventional combinatorial models which aim at different error types in SNP fragments, in this paper we propose a new statistical model-a Markov chain model for haplotype assembly based on information of SNP fragments. The main advantage of this model over combinatorial ones is that it requires no prior information on error types in data. In addition, unlike exact algorithms with the exponential-time computation complexity for most combinatorial models, the proposed model can be solved in polynomial time and thus is efficient for large-scale problems. Experiment results on several data sets illustrate the effectiveness of the new method.
    Download PDF (1084K)
  • David Venet, Hugues Bersini, Hitoshi Iba
    2006 Volume 17 Issue 2 Pages 172-183
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Clustering of the samples is a standard procedure for the analysis of gene expression data, for instance to discover cancer subtypes. However, more than one biologically meaningful clustering can exist, depending on the genes chosen. We propose here to group the genes in function of the clustering of the samples they fit. This allows to determine directly the different clusterings of the samples present in the data. As a clustering is a structure, genes belonging to the same group are functions of the same structure. Hence, the determination of groups of genes which support the same clustering could also be viewed as the detection of non-linearly linked genes. MetaClustering was applied successfully to simulated data. It also recovered the known clustering of real cancer data, which was impossible using the complete set of genes. Finally, it clustered together cell-cycle genes, showing its ability to group genes related in a non-linear way.
    Download PDF (5844K)
  • Pritha Mahata
    2006 Volume 17 Issue 2 Pages 184-193
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Epithelial carcinoma of the ovary is one of the most common gynecological malignancies and the fifth most frequent cause of cancer death in women. Currently blood test of advanced epithelial tumors are reflected in a high level of CA 125 antigen. However, it is not a good marker for early stage tumors, and may yield false positives. Clearly, there is a need for better understanding of the molecular pathogenesis of epithelial ovarian cancer, so that new drug targets or biomarkers that facilitate early detection can be identified. This work concentrates on finding genetic markers for three epithelial ovarian tumors, using a simple computational method.
    We give a small set of genetic markers which are able to distinguish clear cell and mucinous ovarian cancers (13 and 26 genes respectively) from other epithelial ovarian tumors with 100% accuracy. We obtain the genes HNF1-beta (TCF2) and GGT1 as the best markers for the clear cell and CEACAM6 (NCA) as the best marker for mucinous ovarian tumors. We employ a feature selection technique based on minimum probability of error for this purpose. We give a ranking of the important genes responsible for these tumors and validate the results using the leave-one-out cross-validation technique.
    Using this method, we also agree with the common notion that WT1 is one of the best genes to separate serous ovarian tumors from other epithelial ovarian tumors.
    Download PDF (1229K)
  • Kang Ning, Hoong Kee Ng, Hon Wai Leong
    2006 Volume 17 Issue 2 Pages 194-205
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Peptide identification by tandem mass spectrometry is both an important and challenging problem in proteomics. At present, huge amount of spectrum data are generated by high throughput mass spectrometers at a very fast pace, but algorithms to analyze these spectra are either too slow, not accurate enough, or only gives partial sequences or sequence tags. In this paper, we emphasize on the balance between identification completeness and efficiency with reasonable accuracy for peptide identification by tandem mass spectrum. Our method works by converting spectra to vectors in high-dimensional space, and subsequently use self-organizing map (SOM) and multi-point range query (MPRQ) algorithm as a coarse filter reduce the number of candidates to achieve efficient and accurate database search. Experiments show that our algorithm is both fast and accurate in peptide identification.
    Download PDF (6042K)
  • Yung-Chiang Chen, Heng-Chu Chen, Jinn-Moon Yang
    2006 Volume 17 Issue 2 Pages 206-215
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    DAPID is a database of domain-annotated protein interactions inferred from three-dimensional (3D) interacting domains of protein complexes in the Protein Data Bank (PDB). The DAPID data model allows users to visualize 3D interacting domains, contact residues, and molecular details of any predicted protein-protein interactions. Our model derives these interactions by utilizing a new concept, called the “3D-domain interologs” which is similar to “interologs”. In S. cerevisiae, there is 18.6% overlap between our predicted protein-protein interactions and ones in the DIP database. The mean correlation coefficient of the gene expression profiles of our predicted interactions is significantly higher than that for random pairs in S. cerevisiae. In addition, we find several novel interactions which are consistent with the functions of the proteins. The DAPID currently holds 1008 3D-interacting domain pairs and 101511 predicted 3D-domain annotated protein-protein interactions. It is available online at http://gemdock.life.nctu.edu.tw/dapid.
    Download PDF (6038K)
  • Keunwan Park, Dongsup Kim
    2006 Volume 17 Issue 2 Pages 216-225
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    As the number of protein sequences with unknown function increases, assigning accurate function to unknown protein becomes increasingly an important issue. Protein function is often encoded in a small number of residues located in binding pocket, and there have been many attempts to predict the function using the binding site. Here, we developed a binding site comparison method which can easily identify spatially matched residues between binding sites. Using clique detection algorithm, the new method finds the matched residues of maximum size, and then these matched residues are scored in a way similar to sequence alignment scoring. In addition, the significance of matched score is estimated from the empirical random score distribution. Results of benchmark test suggest that the method successfully detects functionally related binding sites. Furthermore, conserved residues and subfamily-specific residues in the functional family can be identified. In addition, we investigated systematic relationship between binding sites and functions using the binding site comparison method. Result showed that proteins with similar binding site largely perform similar function.
    Download PDF (6188K)
  • Shinya Tasaki, Masao Nagasaki, Masaaki Oyama, Hiroko Hata, Kazuko Ueno ...
    2006 Volume 17 Issue 2 Pages 226-238
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Cell Illustrator is a model building tool based on the Hybrid Functional Petri net with extension (HFPNe). By using Cell Illustrator, we have succeeded in modeling biological pathways, e.g., metabolic pathways, gene regulatory networks, microRNA regulatory networks, cell signaling networks, and cell-cell interactions. The recent development of tandem mass spectrometry coupled with liquid chromatography (LC/MS/MS) technology has enabled researchers to quantify the dynamic profile of a wide range of proteins within the cell. The proteomic data obtained by using LC/MS/MS has been considerably useful for introducing dynamics to the HFPNe model. Here, we report the first introduction of the time-series proteomic data to our HFPNe model. We constructed an epidermal growth factor receptor signal transduction pathway model (EFGR model) by using the biological data available in the literature. Then, the kinetic parameters were determined in the data assimilation (DA) framework with some manual tuning so as to fit the proteomic data published by Blagoev et al.(Nat. Biotechnol., 22: 1139-1145, 2004). This in silico model was further refined by adding or removing some regulation loops using biological background knowledge. The DA framework was used to select the most plausible model from among the refined models. By using the proteomic data, we semi-automatically constructed a well-tuned EGFR HFPNe model by using the Cell Illustrator coupled with the DA framework.
    Download PDF (4349K)
  • Yvonne Y. Li, Jianghong An, Steven J. M. Jones
    2006 Volume 17 Issue 2 Pages 239-247
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have developed a computational pipeline for the prediction of protein-small molecule interactions and have applied it to the drug repositioning problem through a large-scale analysis of known drug targets and small molecule drugs. Our pipeline combines forward and inverse docking, the latter of which is a twist on the conventional docking procedure used in drug discovery: instead of docking many compounds against a specific target to look for potential inhibitors, one compound is docked against many proteins to search for potential targets. We collected an extensive set of 1, 055 approved small molecule drugs and 1, 548 drug target binding pockets (representing 78 unique human protein therapeutic targets) and performed a large-scale docking using ICM software to both validate our method and predict novel protein-drug interactions. For the 37 known protein-drug interactions in our data set that have a known structure complex, all docked conformations were within 2.0Å of the solved conformation, and 30 of these had a docking score passing the typical ICM score threshold. Out of the 237 known protein-drug interactions annotated by DrugBank, 74 passed the score threshold, and 52 showed the drug docking to another protein with a better docking score than to its known target. These protein targets are implicated in human diseases, so novel protein-drug interactions discovered represent potential novel indications for the drugs. Our results highlight the promising nature of the inverse docking method for identifying potential novel therapeutic uses for existing drugs.
    Download PDF (8411K)
  • Kyle Ellrott, Jun-tao Guo, Victor Olman, Ying Xu
    2006 Volume 17 Issue 2 Pages 248-258
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Integer programming is a combinatorial optimization method that has been successfully applied to the protein threading problem. We seek to expand the model optimized by this technique to allow for a more accurate description of protein threading. We have developed and implemented an expanded model of integer programming that has the capability to model secondary structure element deletion, which was not possible in previous version of integer programming based optimization.
    Download PDF (1140K)
  • Jayavardhana Gubbi, Alistair Shilton, Michael Parker, Marimuthu Palani ...
    2006 Volume 17 Issue 2 Pages 259-269
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    The determination of the first 3-D model of a protein from its sequence alone is a non-trivial problem. The first 3-D model is the key to the molecular replacement method of solving phase problem in x-ray crystallography. If the sequence identity is more than 30%, homology modelling can be used to determine the correct topology (as defined by CATH) or fold (as defined by SCOP). If the sequence identity is less than 25%, however, the task is very challenging. In this paper we address the topology classification of proteins with sequence identity of less than 25%. The input information to the system is amino acid sequence, the predicted secondary structure and the predicted real value relative solvent accessibility. A two stage support vector machine (SVM) approach is proposed for classifying the sequences to three different structural classes (α, β, α+β) in the first stage and 39 topologies in the second stage. The method is evaluated using a newly curated dataset from CATH with maximum pairwise sequence identity less than 25%. An impressive overall accuracy of 87.44% and 83.15% is reported for class and topology prediction, respectively. In the class prediction stage, a sensitivity of 0.77 and a specificity of 0.91 is obtained. Data file, SVM implementation (SVMHEAVY) and result files can be downloaded from http://www.ee.unimelb.edu.au/ISSNIP/downloads/.
    Download PDF (1208K)
  • Carlos A. Del Carpio, Pei Qiang, Eiichiro Ichiishi, Hideyuki Tsuboi, M ...
    2006 Volume 17 Issue 2 Pages 270-278
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    A novel algorithm is introduced to deal with intra-molecular motions of loops and domains that undergo proteins at interaction with other proteins. The methodology is based on complex energy landscape sampling and robotic motion planning. Mapping high flexibility regions on the protein underlies the proposed algorithm. This is the first time this type of research has been reported. Application of the methodology to several protein complexes where remarkable backbone rearrangement is observed shows that the new algorithm is able to deal with the problem of change of backbone conformation at protein interaction. We have implemented the module within the system MIAX (Macromolecular interaction assessment computer system) and together with our already reported soft and flexible docking algorithms we have developed a powerful tool for protein function analysis as part of wide genome function evaluation.
    Download PDF (5909K)
  • Ivo L Hofacker
    2006 Volume 17 Issue 2 Pages 281-282
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (181K)
  • Yoshihide Hayashizaki
    2006 Volume 17 Issue 2 Pages 283
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    We have established a large-scale system named CAGE (CAP-based analysis of gene expression), for identifying the 5' Transcription Start Sites (TSS) and promoter regions. With this system we have obtained over 10, 000, 000 CAGE tags from human and mouse. We have also determined the sequences of more than 100, 000 full-length cDNAs from mouse, which were subsequently used to study the transcriptional landscape in mammals. From this large data set, the 5' and 3' boundaries of 181, 047 transcripts with extensive variations arising from alternative promoter usage, splicing and polyadenylation, were identified. Subsequent genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. Additional complex transcriptional genomic regions were observed, named “chains”, possessing alternative forms and overlapping transcripts. As a summation, there are 16, 247 new mouse protein-coding transcripts, including 5, 154 encoding novel proteins. Also including new sense-antisense transcripts, 36, 372 cis- and trans-antisense events in full-length cDNAs, 1, 457 chains, 1, 499 “gene fusions” and non-coding RNA.
    Our CAGE tag method allows us to quantitatively analyze promoter usage in different tissues, revealing that differentially regulated alternative TSSs are a common feature in genes. The data permits genome-scale identification of tissue-specific promoters and analysis of associated cis-acting elements.
    These data provide a comprehensive platform for comparative analyses of mammalian transcriptional regulation in differentiation and development.
    Download PDF (79K)
  • Jin Chen, Hon Nian Chua, Wynne Hsu, Mong-Li Lee, See-Kiong Ng, Rintaro ...
    2006 Volume 17 Issue 2 Pages 284-297
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    High-throughput experimental methods, such as yeast-two-hybrid and phage display, have fairly high levels of false positives (and false negatives). Thus the list of protein-protein interactions detected by such experiments would need additional wet laboratory validation. It would be useful if the list could be prioritized in some way. Advances in computational techniques for assessing the reliability of protein-protein interactions detected by such high-throughput methods are reviewed in this paper, with a focus on techniques that rely only on topological information of the protein interaction network derived from such high-throughput experiments. In particular, we discuss indices that are abstract mathematical characterizations of networks of reliable protein-protein interactions-e.g., “interaction generality”(IG), “interaction reliability by alternatve pathways”(IRAP), and “functional similarity weighting”(FSWeight). We also present indices that are based on explicit motifs associated with true-positive protein interactions-e.g., “new interaction generality”(IG2) and “meso-scale motifs”(NeMoFinder).
    Download PDF (1792K)
  • Yasubumi Sakakibara, Temple Smith
    2006 Volume 17 Issue 2 Pages v
    Published: 2006
    Released on J-STAGE: July 11, 2011
    JOURNAL FREE ACCESS
    Download PDF (94K)
feedback
Top