Since once premature convergence happens evolutionary algorithms for function optimization can no longer explore areas of the search space and fail to find the optimum, it is required to handle the notorious drawback. This paper proposes two novel approaches to overcome premature convergence of real-coded genetic algorithms (RCGAs). The first idea is to control the sampling region of crossover by adaptation of expansion rate. The second idea is to cause the acceleration of the movement of population by descending the mean of crossover. Finally, we propose a crossover that combines the adaptation of expansion rate technique and the crossover mean descent technique, called AREX (adaptive real-coded ensemble crossover). The performance of the real-coded GA using AREX is evaluated on several benchmark functions including functions whose landscape forms ridge structure or multi-peak structure, both of which are likely to lead to the miserable convergence phenomenon. The experimental results show not only that the proposed method can locate the global optima of functions on which it is difficult for the existing GAs to discover it but also that our approach outperforms the existing one in number of function evaluations on all functions. Our approach enlarges the classes of functions that real-coded GAs can solve.
This paper proposes two frameworks to be used in engineering tree kernels. One is to ensure that the resulting tree kernels be positive semidefinite, while the other is for efficient algorithms to compute the kernels based on the dynamic programming. The first framework provides a method to construct tree kernels using primitive kernels for simpler structures (e.g. for labels, strings) as building blocks and a easy-to-check sufficient condition for the resultant tree kernels to be positive semidefinite. The second framework provides a set of templates of algorithms to calculate a wide range of tree kernels in O(|X|3) or O(|X|2)-time, where |X| denotes the number of vertices of trees.
This paper presents a method for boosting the performance of the organization name recognition, which is a part of named entity recognition (NER). Although gazetteers (lists of the NEs) have been known as one of the effective features for supervised machine learning approaches on the NER task, the previous methods which have applied the gazetteers to the NER were very simple. The gazetteers have been used just for searching the exact matches between input text and NEs included in them. The proposed method generates regular expression rules from gazetteers, and, with these rules, it can realize a high-coverage searches based on looser matches between input text and NEs. To generate these rules, we focus on the two well-known characteristics of NE expressions; 1) most of NE expressions can be divided into two parts, class-reference part and instance-reference part, 2) for most of NE expressions the class-reference parts are located at the suffix position of them. A pattern mining algorithm runs on the set of NEs in the gazetteers, and some frequent word sequences from which NEs are constructed are found. Then, we employ only word sequences which have the class-reference part at the suffix position as suffix rules. Experimental results showed that our proposed method improved the performance of the organization name recognition, and achieved the 84.58 F-value for evaluation data.
There are so many opportunities to transmit text information on the Web. Since texts on the Web are not always written by professional writers, those may not be coherent or may be hard to be comprehended. Therefore, we should take too much time and energy to grasp topic relevance of a text. This paper describes HINATA system that visualizes texts using light and shadow based on topic relevance. Topic is defined as a set of words such as nouns contained in a title of a text. The light expresses sentences related to a topic, and the shadow expresses sentences unrelated to a topic. This visualization method efficiently supports users for finding the parts related to a topic, and for grasping relations between sentences of a text and a topic. Experimental results showed that the proposed system could support users for understanding how a text was related to a topic.
This paper shows a new method of extracting important words from newspaper articles based on time-sequence information. This word extraction method plays an important role in event sequence mining. TF-IDF is a well-known method to rank word's importance in a document. However, the TF-IDF method never consider the time information embedded in sequential textual data, which is peculiar to newspapers. In this research, we will propose a new word-extraction method, called the TF-IDayF method, which considers time-sequence information, and can extract important/characteristic words expressing sequential events. The TF-IDayF method never use so-called burst phenomenon of topic word occurrences, which has been studied by lots of researchers. The TF-IDayF method is quite simple, but effective and easy to compute in sequential textual mining. We evaluate the proposed method from three points of view, i.e., a semantic viewpoint, a statistical one and a data mining viewpoint through several experiments.
Geographic information retrieval (GIR) aims at the retrieval of geographic-related documents based through the use of not only on keyword relevance but also on geographic relationships between the query and the geographic information in the texts. However, how to show search results in GIR has not been studied well, especially with regard to generating snippets that reflect the geographic part of the query. This paper proposes a novel snippet generation method. Our method first converts geographic phrases in the target text into geographic coordinates, then scores each of them according to their distance from the query using the coordinates. Next, it extracts fragments of the target text based on the distribution of the query keyword and geographic scores, and presents the combined fragments as a snippet. Evaluations are conducted with regard to two different aspects. Both attributes confirm the effectiveness of our method.
We propose an open-ended dialog system that generates a proper sentence to a user's utterance using abundant documents on the World Wide Web as sources. Existing knowledge-based dialog systems give meaningful information to a user, but they are unsuitable for open-ended input. The system Eliza can handle open-ended input, but it gives no meaningful information. Our system lies between the above two dialog systems; it converses on various topics and gives meaningful information related to the user's utterances. The system selects an appropriate sentence as a response from documents gathered through the Web, on the basis of surface cohesion and shallow semantic coherence. The surface cohesion follows centering theory and the semantic coherence is calculated on the basis of the conditional distribution and inverse document frequency of content words (nouns, verbs, and adjectives.) We developed a trial system to converse about movies and experimentally found that the proposed method generated 66% appropriate responses.
Recent work on temporal relation identification has focused on three types of relations between events: temporal relations between an event and a time expression, between a pair of events and between an event and the document creation time. These types of relations have mostly been identified in isolation by event pairwise comparison. However, this approach neglects logical constraints between temporal relations of different types that we believe to be helpful. We therefore propose a Markov Logic model that jointly identifies relations of all three relation types simultaneously. By evaluating our model on the TempEval data we show that this approach leads to about 2% higher accuracy for all three types of relations ---and to the best results for the task when compared to those of other machine learning based systems.
We propose a method for extracting information on the technical effect from a patent document. The information on the technical effect extracted by our method is useful for generating patent maps (see e.g., Figure 1.) automatically or analyzing the technical trend from patent documents. Our method extracts expressions containing the information on the technical effect by using frequent expressions and clue expressions effective for extracting them. The frequent expressions and clue expressions are extracted by using statistical information and initial clue expressions automatically. Our method extracts expressions containing the information on the technical effect without predetermined patterns given by hand, and is expected to be applied to other tasks for acquiring expressions that have a particular meaning (e.g., information on the means for solving the problems) not limited to the information on the technical effect. Our method achieves not only high precision (78.0%) but also high recall (77.6%) by acquiring such clue expressions automatically from patent documents.
It is important for R&D managers, consultants, and other people seeking broad knowledge in technology fields to survey technical literature such as research papers, white papers, and technology news articles. One of the important kinds of information for those people regards the effectiveness of new technologies in their own businesses. General search engines are good at selecting documents revealing the details of a specific technology or a technology field, but it is hard to obtain useful information about how a technology will apply to individual business cases from such search results. There is a need for a technology survey assistance tool that helps users find technologies with suitable capabilities. In this paper, two technical tasks were tackled to develop the prototype of this assistance tool: Extraction of advantage phrases and scoring for the advantage phrases to find novel applications in the target technology field. We describe a new method to identify advantage phrases in technical documents and our scoring function that gives higher scores to novel applications of the technology. The results of evaluations showed our phrase identification method with only a few phrasal patterns performs almost as well as human annotators, and the proposed scoring conforms better to the decisions made by professionals than random sort.
Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as arts, geography, history, science, sports and games. As a corpus for knowledge extraction, Wikipedia's impressive characteristics are not limited to the scale, but also include the dense link structure, URL based word sense disambiguation, and brief anchor texts. Because of these characteristics, Wikipedia has become a promising corpus and a new frontier for research. In the past few years, a considerable number of researches have been conducted in various areas such as semantic relatedness measurement, bilingual dictionary construction, and ontology construction. Extracting machine understandable knowledge from Wikipedia to enhance the intelligence on computational systems is the main goal of "Wikipedia Mining," a project on CREP (Challenge for Realizing Early Profits) in JSAI. In this paper, we take a comprehensive, panoramic view of Wikipedia Mining research and the current status of our challenge. After that, we will discuss about the future vision of this challenge.
This paper proposes a new technology,``a bodygraphic injury surveillance system (BISS)'' that not only accumulates accident situation data but also represents injury data based on a human body coordinate system in a standardized and multilayered way. Standardized and multilayered representation of injury enables accumulation, retrieval, sharing, statistical analysis, and modeling causalities of injury across different fields such as medicine, engineering, and industry. To confirm the effectiveness of the developed system, the authors collected 3,685 children's injury data in cooperation with a hospital. As new analyses based on the developed BISS, this paper shows bodygraphically statistical analysis and childhood injury modeling using the developed BISS and Bayesian network technology.
Purpose of this study is to explore sustainable service where continuous validation is possible through the development of support service for prevention and recovery from dementia towards science of lethe. We designed and implemented conversation support service via coimagination method based on multiscale service design method, both were proposed by the author. Interactive conversation supported by coimagination method generates social interaction so as to prevent progress of dementia. Multiscale service model consists of tool, event, human, network, style and rule. Service elements at different scales of tool, event, and human were developed according to the model. Firstly, we developed conversation interactivity measuring method in order to measure intensity of cognitive activities for prevention of dementia (event). Secondly, education program for learning coimagination method was designed and provided in order to bring out social intelligence of participants and instructors (human). Thirdly, relationship between social intelligence and prevention of dementia is discussed based on the experimental data (tool).