In recent years, the mining of frequent subgraphs from labeled graph data has been extensively studied. However, to our best knowledge, almost no methods have been proposed to find frequent subsequences of graphs from a set of graph sequences where the numbers of vertices and edges increase or decrease. In this paper, we define a novel class of graph subsequences by introducing axiomatic rules of graph transformation, their admissibility constraints and a union graph. Then we propose an efficient approach named "GTRACE" to enumerate frequent transformation subsequences (FTSs) of graphs from a given set of graph sequences. Its fundamental performance has been evaluated by using artificial datasets, and its practicality has been confirmed through the experiments using real world datasets.
Two critical bottle necks in mining frequent tree patterns from tree databases are the exponential number of mined trees and the lack of user focus onthe mining process. We propose, in this paper, an algorithm that solves the problems by mining only the compact representation of tree pattens, i.e. maximal tree patterns, and allows users to mine only trees of their interest by specifying subtree constraints. The experimental results show the efficiency of our algorithm.
We analytically and computationally assessed the performance of the graph spectrum based isomorphism identification, and applied the results to prune redundant candidate enumeration in a graph mining algorithm gSpan. The advantageous conditions to gain significant performance of the graph mining have been clarified in this study.
In this paper, as one of the pattern mining in dynamic graphs, we focus on the problem of finding frequent, closed and correlated patterns of graph sequences in a long sequence of graphs. To solve this problem efficiently, an algorithm named CHPSS is proposed which effectively utilizes the generality ordering and the properties of correlation and closedness. Through the preliminary experiments with synthetic and real datasets, the effectiveness of CHPSS is confirmed.
The goal of motif discovery algorithms is to efficiently find unknown recurring patterns in time series. Most available algorithms cannot utilize domain knowledge in any way which results in quadratic or at least sub-quadratic time and space complexity. For large time series datasets for which domain knowledge can be available this is a severe limitation. In this paper we define the Constrained Motif Discovery problem which enables utilization of domain knowledge into the motif discovery process. We also show that most unconstrained motif discovery problems be converted into constrained motif discovery problem using a change point detection algorithm.We provide two algorithms for solving this problem and compare their performance to state-of-the-art motif discovery algorithms on a large set of synthetic time series. The proposed algorithms can provide linear time and constant space complexity. The proposed algorithms provided four to ten folds increase in speed compared to two state of the art motif discovery algorithms without loss of accuracy and provided better noise robustness in high noise levels.
Transductive inference on graphs such as label propagation algorithms is receiving a lot of attention. In this paper, we address a label propagation problem on multiple networks and present a new algorithm that automatically integrates structure information brought in by multiple networks. The proposed method is robust in that irrelevant networks are automatically deemphasized, which is an advantage over Tsuda et al.'s approach. We also show that the proposed algorithm can be interpreted as an EM algorithm with a Student-t prior. Finally, we demonstrate the usefulness of our method in protein function prediction and digit classification, and show experimentally that our algorithm is much more efficient than existing algorithms.
In fields of machine learning of patterns most conventional methods of feature extraction do not pay much attention to the geometric properties of data, even in cases where the data have spatial features. In this study we introduce geometric algebras to systematically extract invariant geometric features from spatial data given in a vector space. A geometric algebra is a multidimensional generalization of complex numbers and of quaternions, and able to accurately describe oriented spatial objects and relations between them. We further propose a kernel to measure similarity between two series of spatial vectors based on Hidden Markov Models. As an apllication, we demonstrate our new method with the semi-supervised learning of online hand-written digits. The result shows that the feature extraction with geometric algebra improved recognition rate in one-to-one semi-supervised learning problems of online hand-written digits.
We propose geometrical models of features of a learning problem with the assumption that features are not independent. The key idea is to model feature similarity by means of non-orthogonality of a basic in the space. We show that there are two alternative ways to interpret similarities between features within kernel method framework. One follows a projection model while the other follows a reconstruction model. This shed a light on the relations among previous feature similarity methods. We also discuss the use of label information, which has been missing in previous works, for classification within the feature similarity learning step. It turns out that we have discriminative counterparts of previous feature similarity methods.
Here we propose a method for specifying the characteristics of offenders from a body of records on such suspicious persons. The method comprises two steps: the generation of term-document matrices by analyzing records of the offenders' characteristics, and classifying the records on the basis of these characteristics. Since the descriptions comprise Japanese free text, we adopt ChaSen, a morphological analysis system, as a preprocessor for generation term-document matrices. We use a k-means clustering program supported by "MUSASHI" a set of data processing and mining commands. After clustering, we use TF-IDF to assign these groups distinguishable labels. Our mehtod--the combination of morphological analysis and clustering--automatically produces descriptions of repeat offenders and may be useful in the fight against crime.
In this paper, we propose a novel semi-supervised speaker identification method that can alleviate the influence of nonstationarity such as time-dependent voice quality variation, the recording environment change, and speaker feeling. We assume that the voice quality variants follow the covariate shift model, where only the voice feature distribution changes in the training and test phases. Our method consists of weighted versions of kernel logistic regression and crossvalidation and is theoretically shown to have the capability of alleviating the influence of covariate shift. We experimentally show through text-independent speaker identification simulations that the proposed method is promising in dealing with variations in voice quality.
We proposed BaggTaming to boost the prediction accuracy by exploiting additional data whose class labels are less reliable. This algorithm is successfully applied to the personalized tag predicition for the data collected from the delicious. To check whether our method is generally effective, we test the data crawled from the hatena bookmark.