Researchers of agriculture, life science and drug design of the need to acquire information that combines two or more life science databases for problem solving. Semantic Web technologies are already necessary for data integration between those databases. This study introduces a technique of utilizing RDF (Resource Description Framework) and OWL (Web Ontology Language) as a data set for development of a machine learning predictor of interactomics. Also, for SPARQL (SPARQL Protocol and RDF Query Language) we sketched the implementing method of interactomics LOD (Linked Open Data) in the graph database. Interactomics LOD has included the pairs of protein--protein interactions of tyrosine kinase, the pairs of amino acid residues of sugar (carbohydrate) binding proteins, and cross-reference information of the protein chain among an entry of major bioscience databases since 2013. Finally, we designed three RDF schema models and made access possible using AllegroGraph 4.11 and Virtuoso 7. The number of total triples was 1,824,859,745 in these databases. It could be combined with public LOD of the life science domain of 28,529,064,366 triples and was able to be searched. We showed that it was realistic to deal with large-scale LOD on a comparatively small budget by this research. The cost cut by LOD decreased not only expense but development time. Especially RDF-SIFTS (Structure Integration with Function, Taxonomy and Sequence) that is an aggregate of 10 small LOD was constructed in the short period of BioHackathon 2013 or was developed in one week. We could say that we can obtain quickly a data set required for the machine learning of interactomics by using LOD. We set up the interactomics LOD for application development as a database. SPARQL endpoints of these databases are exhibited on the portal site UTProt (The University of Tokyo Protein, http://utprot.net).
As one of the core technologies of the Semantic Web, the RDF data model enables us to represent machine-readable metadata about Web resources. SPARQL is the standard query language for answering queries from RDF data. In order to support SPARQL in practice, it is important to access such metadata quickly from large scale RDF data stores, such as Gene Ontology and DBpedia. In this paper, we present an algorithm that decides the solve order of an inputed query. We formalize the search space reduction of the connected variables occurring in a query and establish the costs of different query patterns. In particular, we consider selecting one of the various orders of elements in each query that results in less computation for searching and reasoning steps, i.e., matching a subgraph of a complex RDF graph. Using the algorithm, we implement an efficient query system with a RDF store (called NodeStore) for large-scale RDF data. In the RDF store, an indexed data structure of RDF graphs is well constructed for optimizing RDF data processing, e.g., finding a set of RDF triples including a common resource. For the evaluation of deciding the solve order of a query, we show some experimental results for our query system NodeStore and the Jena framework using a LUBM dataset (a benchmarking framework for semantic repositories).
Open data is drawing attention for the creation of innovative services in recent years. For promoting a greater number of consumer services on the web, a search function that can reveal what kinds of data are available would be helpful. However, if we assume that the open data is described in a triple language like Resource Description Framework (RDF) in future, full-text search is not suitable for data fragments, and a formal query language is difficult for ordinary users. Therefore, we propose a question answering service based on Linked Open Data. As the problems of using the open data as a knowledge source, we then focus on mapping of question sentence to data schema, and data acquisition. And we propose improvement of accuracy based on user feedback and acquisition of new data by user context information. We also present `Flower Voice' which is an application of the service for assisting with fieldwork and confirm the effectiveness.
Here is discussed how to build up Japanese vocabulary for Japanese Linked Open Data. The vocabulary is constructed by mapping properties of the Japanese Wikipedia Ontology to the Linked Open Vocabularies. The Japanese Wikipedia Ontology is a large scale ontology learned from the Japanese Wikipedia. It includes many properties and property relations (property domains and property ranges). The Linked Open Vocabularies is a large cloud for vocabularies of Linked Open Data. We construct a Japanese vocabulary semi-automatically by mapping properties to vocabularies. Experimental case studies show us that we can use the built Japanese vocabulary as a general vocabulary for building Japanese Linked Open Data.
Publishing open data as linked data is a significant trend in not only the Semantic Web community but also other domains such as life science, government, media, geographic research and publication. One feature of linked data is the instance-centric approach, which assumes that considerable linked instances can result in valuable knowledge. In the context of linked data, ontologies offer a common vocabulary and schema for RDF graphs. However, from an ontological engineering viewpoint, some ontologies offer systematized knowledge, developed under close cooperation between domain experts and ontology engineers. Such ontologies could be a valuable knowledge base for advanced information systems. Although ontologies in RDF formats using OWL or RDF(S) can be published as linked data, it is not always convenient to use other applications because of the complicated graph structures. Consequently, this paper discusses RDF data models for publishing ontologies as linked data. As a case study, we focus on a disease ontology in which diseases are defined as causal chains.
Negative association rules represent some relationships between presence and absence of itemsets. In general, the number of negative association rules is enormously huge even if compared with that of positive association rules. Therefore, an efficient mining method is quite important. In this paper, we propose a novel top-down mining method for negative association rules in the forms of X ⇒￢Y and ￢X ⇒Y. The proposed method search a suffix tree over frequent itemsets in a top-down manner, and efficiently extract all of valid negative rules of these two types, step by step. The suffix tree plays very important roles for effectively pruning a lot of redundant searches such as the one producing non-minimal valid negative rules. We also show some good results of experiments for evaluating our proposed method. The proposed method is given for simple negative rule mining based on the support and confidence measures, therefore is definitely the most fundamental and important framework, into which additional measures can be easily introduced if necessary.
This paper proposes a surface-similarity based method for recognizing textual entailment (RTE) in Japanese. First, we experimentally show that there is a positive correlation between semantic similarity (textual entailment) and surface similarity between sentences. The most effective measure of surface similarity for RTE is the character overlap ratio, which achieves classification accuracy of 78.3%. Based on the result, we design a two-step RTE system for binary classification. The first step classifies a given text pair into positive or negative entailment based on the character overlap ratio. If the pair is classified into the positive class, the second step examines whether the assigned class should be flipped or not by using heuristic rules that detect the mismatch of named entities and numbers. In addition to the RTE system, we also implement the MC system that classifies a given text pair into one of four classes (forward entailment, bidirectional entailment, contradiction, and the others), by combining a contradiction detector and the RTE system. In the RITE-2 formal run, the RTE system was ranked 7th among 42 systems at the RTE task, and the MC system was ranked first among 21 systems at the MC task. These results show that the surface-similarity based method achieves high performance in RTE.