Graphical Gaussian models have received considerable attention in various field of research such as bioinformatics. We propose a new method on the parameter estimation and model selection in graphical Gaussian models based on L1-regularization. In the proposed method, the structural learning for graphical Gaussian models is equivalent to the selection of the regularization parameters in the L1-regularization. We investigate this problem from a Bayes approach and derive an empirical Bayesian information criterion for choosing them. We analyze Arabidopsis thaliana microarray data and estimate gene networks.
We designed a new data assimilation approach for a tsunami simulation model andtide gauge records, and obtained corrected bottom topography of the Japan Sea. In this dataassimilation problem, the number of the We designed a new data assimilation approach for a tsunami simulation model and tide gauge records, and obtained corrected bottom topography of the Japan Sea. In this data assimilation problem, the number of the variables in the simulation model is very large, whereas observed data set is limited, which results in an ill-posed inversion problem. Therefore, some techniques are required in order to analyze the system, such as introducing appropriate prior informations to the system model. In this paper, we discuss the new approach for constructing the state space model of the tsunami simulation model and the tide gauge records. It is also discussed in view of the graphical model. We show the result of the numerical experiment for validation of the method. We also show the result of the analysis using the tide gauge records of Okushiri Tsunami.
The needs of efficient image query is rapidly increasing under the development of broadband network and multimedia communication. The recent techniques in the image query extract thousands of points of interest (POI) from each image, and represent the feature of each POI by a Scale Invariant Feature Transform (SIFT) vector. The similarity among images is evaluated by the matching of SIFT vectors and used for the query. Crucial issues in the similarity evaluation are an appropriate formulation of the distance measure to evaluate the similarity between images and the establishment of an efficient computation method of the distance measure. In this report, we formulate the distance measure based on extreme value analysis and propose an efficent computaion method of the distance measure based on a k-NN query technique named "SASH".
Kohonen's Self-Organizing Map is an useful technique to capture an entire picture of large scale data because SOM makes clusters and visualize at the same time. We have proposed Sequence-based SOM that captures the dynamics of the clusters while preseving the properties of SOM. In this paper, we studied the basic properties of Sequence-based SOM using simulated two dimensional sequence data and the real world news articles.
多次元データの次元縮小は、データ解析において基本的な手法である。線形な次元縮小として、因子分析、主成分分析、射影追跡法、独立成分分析などがよく知られている。これらは、線形射影が基本となっている。非線形な次元縮小としては、多くの可能性が考えられる。 本報告では、Gnanadesikan[3] による一般化主成分分析法の理論的考察、改良、さらには、その限界から代数曲線当てはめに発展させていった過程を紹介する。
Under the development of ubiquitous sensing, electric documents and multi-media technologies, data sets consisting of high dimensional and massive instances have become available in various practical fields. Efficient evaluation of the similarity measures, e.g., correlations and kernels, among such instances is one of the most important tasks required by major data mining techniques, for the instance queries and clustering. However, the computational complexity of the direct computation for n objects is O(n2) which is practically intractable under the high dimensional and/or massive data, and complex similarity measures. Moreover, some scientific similarity measurements among objects take much time and cost such as the case of the gene expression experiments. The objective of this paper is to provide an efficient remedy tothis problem. We propose a fast approach to estimate the similarity measures among n instances based on the partially and actually computed and/or observed similarity measures together with a mathematical constraint called "Positive Semi-Definiteness (PSD)" governing the similarity measures. The superior performance of our approach in both efficiency and accuracy of the estimation is demonstrated though the evaluation based on artificial and real world data sets.
Mining frequently appearing patterns in a database is a basic problem in recent informatics, especially in data mining. Particularly, when the input database is a collection of subsets of an itemset, the problem is called frequent itemset mining problem, and has been extensively studied. In the real world use, one of popular difficulties of frequent itemset mining is that data is often incorrect, or missing some parts. It causes that some records which should/may include a pattern become not including the pattern. Thus, in real world problems, it is valuable to use an ambiguous inclusion relation and find patterns which is "almost" included in many records. However, from the difficulty on the computation, such kind of problems have not been actively studied. In this paper, we use an alternative inclusion relation in which we consider an itemset P is included in an itemset T if at most k items of P are not included in T, i.e., |P \ T| <_ k. We address the problem of enumerating frequent itemset under this inclusion and propose an effi-cient polynomial delay polynomial space algorithm. To skip so many small non-valuable frequent itemsets, we propose an algorithm for directly enumerating frequent itemsets of a certain size.
This paper presents a pattern mining method with constraints for mean value of numeric attribute by constraint satisfaction probability. From the standpoint of usability, it is sometimes required that a method quickly mine patterns satisfying the constraint, although it cannot enumerate them completely. Constraint satisfaction probability is a ratio of constraint satisfaction pattern to the possible super pattern. We attempt to decide search priority by constraint satisfaction probability in sequential pattern mining. In our experimental evaluation, we show that pattern mining by constraint satisfaction probability is a little more effective than simple heuristics.
Inductive logic programming(ILP) is a learning approach that incorporates first order logic to knowledge representation to the inductive learning. As a result of inductive inference, it can generate expressive rules including variables and relations, and it is useful as one of the technique of the data mining. In this paper, we apply ILP to the learning problem to obtain models of music structure that evokes subjects' specific feelings from their evaluations of existing tunes and the music structures of them.
This paper proposes a method to measure the effects of TV advertisements on the Internet bulletin boards. Two kinds of time series data are generated based on the proposed method. First one represents the time series fluctuation of the interests on the TV advertisements. Another one represents the time series fluctuation of the images on the products. By applying autocorrelation function to the former time series data, we try to specify the duration in which the impact of TV advertisement last. The time series ingredients of the duration represent the effects of the TV advertisements. The experiment shows several results indicating the effectiveness of the method.
In this paper, we propose a new method for discovering hidden knowledge from largescale transaction databases by considering a property of cofactor implication. Cofactor implication is an extension or generalization of symmetric itemsets, which has been presented recently. Here we discuss the meaning of cofactor implication for the data mining applications, and show an efficient algorithm of extracting all non-trivial item pairs with cofactor implication by using Zero-suppressed Binary Decision Diagrams (ZBDDs). We show an experimental result to see how many itemsets can be extracted by using cofactor implication, compared with symmetric itemset mining. Finally, we present some case study results on practical benchmark datasets to see the actual meaning of cofactor implication and how it is interesting. Our result shows that the use of cofactor implication has a possibility of discovering a new aspect of structural information hidden in the databases.
With daily enterprise activity, the activity has been electronically accumulated in the form of documents including a newspaper. By retrieving the documents about an attention enterprise, it is the environment which can get to know the enterprise. However, actually it is difficult to grasp the enterprise activity due to the huge amount of documents. It is the situation that an enterprise evaluation such as growth characteristics can not help relying on the fragmentary memory of each analyst. We propose a quantification method of enterprise characteristics using documents in this paper. This proposed method enable to give the degree of arbitrary characteristics concerned to the enterprise quantitatively, and then an comprehensive enterprise evaluation is promoted. In addition, this method realizes extraction of documents revevant to each characteristic by a common principle. By this method, a document is expressed by a vector based on included terms and an object of an enterprise and its characteristic is generated through a vector set corresponding to the documents in a form of subspace. The degree of relations between them is quantified by an angle of the two subspace. The effectiveness of the proposed method was shown through experimental results using newspaper for the case of an enterprise's intellectual potential.
This paper focuses on Optimization Methods based on Probability Models (OMPM) that statistically estimate the distribution of promising solutions from obtained samples and draw new samples from the estimated distribution, and proposes a novel method for OMPM to improve the accuracy of the statistical estimations by maintaining the previously generated samples more precisely than the conventional methods like Estimation of Distribution Algorithms. The key idea of the proposed method is to update the population (i.e., the set of samples) to follow the target distribution by weighting generated samples and resampling them according to importance sampling. Experimental comparisons between the proposed method and a conventional method have revealed the following two advantages of the proposed method: (1) the proposed method finds better solutions than the conventional method; and (2) the proposed method can control the convergence speed.
We present the intentional kernel as a new class of kernel functions for structured data. The convolution kernel, that is a typical class of kernel functions is based on sub-structures. On the other hand, the intentional kernel is based on derivations. We also show applications oft he intentional kernel, such as boolean functions, first-order terms, RNA sequences. We show properties of the intentional kernel, and discuss the difference between intentional kernel and convolution kernel.
The kernel classifier that realizes a nonlinear classification such as Support Vector Machine has been successfully implemented in a number of fields. In the kernel method, the appropriate selection or design of the kernel function is important for the construction of a classifier that has high performance. The present paper describes a normalized frequency spectrum classification method using the SVM with the Kullback-Leibler (KL) kernel. We introduce the KL kernel to normalized spectrum classification and study the property of similarity calculation of the KL kernel and other common kernels with respect to the change in the appearance position of spectrum peaks.
Suppose we have a set of categorical data, some of which might not be exactly categorical but can be regarded as continuously distributed. We want to discriminate each record either to positive or to negative using other observations as explanatory variables. In such a case, it is natural that we want to use two-factor interactions as well as original data. Now the problem arises. How efficiently and thoroughly can we find 'promising' interaction terms? This article presents a statistical point of view to accomplish this task. We make use of multinomial model fitting by AIC. Then the interaction terms will be chosen so that they give a significantly distributions of a response variable from the marginal distribution of the whole data. Then the cross terms will be put into logit/probit type model. An application to medical data is also shown.