Nowadays the increase of traffic causes numerous serious traffic jams, and traffic signals are desired to work adaptively for dynamic traffic flows. In this paper, we view such a problem of traffic signal control as a multi-agent problem where each signal has a controlling agent, and aim to make the agents work cooperatively depending on the traffic status. To build such an agent program automatically, we introduce genetic programming (GP), an evolutionary method for program construction. In GP, it is known as important to encapsulate the substructures of a program which leads to higher fitness to the environment, and we propose a new encapsulation method using an efficient technique for discovering frequent substructures, which has been recently proposed in the data mining field. We also conducted a simulation with a real traffic data, and confirmed that GP with our encapsulation method outperforms the normal GP. It is also observed that the best individual has a communication part that chooses an appropriate communication area and adapts to the traffic status.
Learning from tree-structured data has received increasing interest with the rapid growth of tree-encodable data in the World Wide Web, in biology, and in other areas. Our kernel function measures the similarity between two trees by counting the number of shared sub-patterns called tree q-grams, and runs, in effect, in linear time with respect to the number of tree nodes. We apply our kernel function with a support vector machine (SVM) to classify biological data, the glycans of several blood components. The experimental results show that our kernel function performs as well as one exclusively tailored to glycan properties.
In this paper, we propose a new variable selection method for "robust" exposure variables. We define "robust" as property that the same variable can select among original data and perturbed data. There are few studies of effective for the selection method. The problem that selects exposure variables is almost the same as a problem that extracts correlation rules without robustness. [Brin 97] is suggested that correlation rules are possible to extract efficiently using chi-squared statistic of contingency table having monotone property on binary data. But the chi-squared value does not have monotone property, so it's is easy to judge the method to be not independent with an increase in the dimension though the variable set is completely independent, and the method is not usable in variable selection for robust exposure variables. We assume anti-monotone property for independent variables to select robust independent variables and use the apriori algorithm for it. The apriori algorithm is one of the algorithms which find association rules from the market basket data. The algorithm use anti-monotone property on the support which is defined by association rules. But independent property does not completely have anti-monotone property on the AIC of independent probability model, but the tendency to have anti-monotone property is strong. Therefore, selected variables with anti-monotone property on the AIC have robustness. Our method judges whether a certain variable is exposure variable for the independent variable using previous comparison of the AIC. Our numerical experiments show that our method can select robust exposure variables efficiently and precisely.
In this paper, we present a method of finding symmetric items in a combinatorial item set database. The techniques for finding symmetric variables in Boolean functions have been studied for long time in the area of VLSI logic design, and the BDD (Binary Decision Diagram) -based methods are presented to solve such a problem. Recently, we have developed an efficient method for handling databases using ZBDDs (Zero-suppressed BDDs), a particular type of BDDs. In our ZBDD-based data structure, the symmetric item sets can be found efficiently as well as for Boolean functions. We implemented the program of symmetric item set mining, and applied it to actual biological data on the amino acid sequences of influenza viruses. We found a number of symmetric items from the database, some of which indicate interesting relationships in the amino acid mutation patterns. The result shows that our method is helpful for extracting hidden interesting information in real-life databases.
Frequent item set mining is one of the fundamental techniques for knowledge discovery and data mining. In the last decade, a number of efficient algorithms for frequent item set mining have been presented, but most of them focused on just enumerating the item set patterns which satisfy the given conditions, and it was a different matter how to store and index the result of patterns for efficient data analysis. Recently, we proposed a fast algorithm of extracting all frequent item set patterns from transaction databases and simultaneously indexing the result of huge patterns using Zero-suppressed BDDs (ZBDDs). That method, ZBDD-growth, is not only enumerating/listing the patterns efficiently, but also indexing the output data compactly on the memory to be analyzed with various algebraic operations. In this paper, we present a variation of ZBDD-growth algorithm to generate frequent closed item sets. This is a quite simple modification of ZBDD-growth, and additional computation cost is relatively small compared with the original algorithm for generating all patterns. Our method can conveniently be utilized in the environment of ZBDD-based pattern indexing.
Recently, the research area of mining in structured data has been actively studied. However, since most techniques for structured data mining so far specialize in mining from single structured data, it is difficult for these techniques to handle more realistic data which is related to various types of attribute and which consists of plural kinds of structured data. Since such kind of data is expected to be going to rapidly increase, we need to establish a flexible and highly accurate technique that can inclusively treat such kind of data. In this paper, as one of the techniques to deal with such kind of data, we propose data mining algorithms of mining classification rules in multidimensional structured data. First, an algorithm with two pruning capabilities of mining correlated patterns is introduced. Then, top-k multidimensional correlated patterns are discovered by using this algorithm repeatedly in the fashion like a beam search. We also show the algorithms for constructing classifiers based on the discovered patterns. Experiments with real world data were conducted to assess the effectiveness of the proposed algorithms. The results show that the proposed algorithms can construct comprehensible and accurate classifiers within a reasonable running time.
We study a regression tree algorithm tailored to casualty insurance pure premium estimation problems. Casualty insurance premium is mainly determined by the expected amount that the insurance companies have to pay for the contract. Therefore, casualy insurance companies have to estimate the expected insurance amount on the basis of insurance risk factors. This problem is formulated as a regression problem, i.e. estimation of conditional mean E[Y|x], where Y is insurance amounts and x is risk factors. In this paper, we aim to implement the regression problem in regression tree framework. The difficulty of the problem lies in the fact that the distribution of insurance amount P(Y|x) is highly skewed and exhibits a long-tail toward positive direction. Conventional least-square-error regression tree algorithm is notoriously unstable under such long-tailed error distribution. On the other hand, several types of robust regression trees, such as least-absolute-error regression tree, are neither appropriate in this situation because they yields significant bias to conditional mean E[Y|x]. In this paper, we propose a two-stage tree fitting algorithm. In the first stage, the algorithm constructs a quantile tree, a kind of robust regression tree, which is stable but biased to conditional mean E[Y|x]. In the second stage, the algorithm corrects the bias using least-square error regression tree. We discuss the theoretical background of the algorithm and empirically investigate the performances. We applied the proposed algorithm to a car insurance data set of 318,564 records provided from a north-american insurance company and obtained significantly better results than conventional regression tree algorithm.
We propose an efficient algorithm for deciding the reachability between any nodes on XML data represented by connected directed graphs. We develop a technique to reduce the size of the reference table for the reachability test. Using the small table and the standard range labeling method for rooted ordered trees, we show that our algorithm answers almost queries in a constant time preserving the space efficiency and a reasonable preprocessing time.
The number of competing-brands changes by new product's entry. The new product introduction is endemic among consumer packaged goods firm and is an integral component of their marketing strategy. As a new product's entry affects markets, there is a pressing need to develop market response model that can adapt to such changes. In this paper, we develop a dynamic model that capture the underlying evolution of the buying behavior associated with the new product. This extends an application of a dynamic linear model, which is used by a number of time series analyses, by allowing the observed dimension to change at some point in time. Our model copes with a problem that dynamic environments entail: changes in parameter over time and changes in the observed dimension. We formulate the model with framework of a state space model. We realize an estimation of the model using modified Kalman filter/fixed interval smoother. We find that new product's entry (1) decreases brand differentiation for existing brands, as indicated by decreasing difference between cross-price elasticities; (2) decreases commodity power for existing brands, as indicated by decreasing trend; and (3) decreases the effect of discount for existing brands, as indicated by a decrease in the magnitude of own-brand price elasticities. The proposed framework is directly applicable to other fields in which the observed dimension might be change, such as economic, bioinformatics, and so forth.
We introduce a new approach to the problem of link prediction for network structured domains, such as the Web, social networks, and biological networks. Our approach is based on the topological features of network structures, not on the node features. We present a novel parameterized probabilistic model of network evolution and derive an efficient incremental learning algorithm for such models, which is then used to predict links among the nodes. We show some promising experimental results using biological network data sets.
Clustering word co-occurrences has been studied to discover clusters as latent concepts. Previous work has applied the semantic aggregate model (SAM), and reports that discovered clusters seem semantically significant. The SAM assumes a co-occurrence arises from one latent concept. This assumption seems moderately natural. However, to analyze latent concepts more deeply, the assumption may be too restrictive. We propose to make clusters for each part of speech from co-occurrence data. For example, we make adjective clusters and noun clusters from adjective--noun co-occurrences while the SAM builds clusters of ``co-occurrences.'' The proposed approach allows us to analyze adjectives and nouns independently.
The task of opinion extraction and structurization is the key component of opinion mining, which allow Web users to retrieve and summarize people's opinions scattered over the Internet. Our aim is to develop a method for extracting opinions that represent evaluation of concumer products in a structured form. To achieve the goal, we need to consider some issues that are relevant to the extraction task: How the task of opinion extraction and structurization should be designed, and how to extract the opinions which we defined. We define an opinion unit consisting of a quadruple, that is, the opinion holder, the subject being evaluated, the part or the attribute in which it is evaluated, and the evaluation that expresses positive or negative assessment. In this task, we focus on two subtasks (a) extracting subject/aspect-evaluation relations, and (b) extracting subject/aspect-aspect relations, we approach each extraction task using a machine learning-based method. In this paper, we discuss how customer reviews in web documents can be best structured. We also report on the results of our experiments and discuss future directions.
The purpose of this research is to develop a framework to analyze the content and a process of persuading process and its application to the communication for the debt-collecting process. It is possible for us to understand how the skilled workers have used the keyword groups concerning the motivation to pay, the payment methods and the payment confirmation in their conversations, to model a persuading process. There is no research and method to deal with a large amount of conversation logs for discovering useful knowledge about a persuading process. In this paper, we were successful in discovering a part of the distinctive features of skilled workers in their conversations for the overdue payment collection, applying our method to communication data in a Japanese telecommunications company.