劣モジュラ関数は,凸関数の離散版に当たる集合関数であり,組合せ最適化を始めとして,情報理論,待ち行列理論,ゲーム理論等,数理工学の様々な分野で頻繁に現れる.ネットワークのカット容量関数,マトロイドの階数関数,多元情報源のエントロピー関数などが劣モジュラ性を有する.本講演では,劣モジュラ関数に関する最適化問題の基礎から最新の成果までを紹介する.
We address a novel and realistic Label Reliability Problem that belongs to the field of supervised learning, where con dence of labeling is different for each training set. Our main idea is to make more precise classi ers by dealing with reliably and not reliably labeled sets seperately. We focus on a novel boosting method that utilizes reliably labeled data. The theoretical investigation on the method makes clear its relation to soft margin approach, cost-sensitive learning and semisupervised learning. We perform detailed experiments that include the boosting method and 8 related methods. The results suggest the superiority of our approach that counts on unreliable labels.
We extend the BDD-EM algorithm, which is an expectation-maximization (EM) algorithm working on binary decision diagrams (BDDs), for shared BDDs (SBDDs) with negative edges. BDDs are used as a compact expression of Boolean formulas and moreover the use of SBDDs and negative edges is expected to reduce the time and space in the case where we have some similar partial structures. We show that the proposed algorithm which utilizes SBDD with negative edges for bipartite noisy-OR network reduces time and space in executing the EM algorithm.
ノイジーなセンサーデータからの異常検知の問題を考えた時、センサー同士の依存関係に現れる異常の検出は、実用上重要かつ困難な問題である。困難の由来はおおむね2つにまとめられる。ひとつは、センサー間の相関がノイズに対しきわめて脆弱なため、異常の兆候をノイズから切り分けるのが難しい点である。2 つ目は、複数の変数ペアにおいて何かの異常が観測されたとしても、その情報を個々のセンサーの異常度に帰着させるのが簡単でない点である。本論文では、前者への解決策として疎な構造学習の手法を用いることを、また後者に対しては、グラフィカル・ガウシアン・モデルから情報論的に自然に導かれる相関異常スコアを用いることを提案する。
The demand for learning machines that can adapt to concept change, the change over time of the statistical properties of a target variable, has become more urgent. We propose a system in which multiple online and offline classifiers are used for learning changing concepts. Experiments with synthetic concept-drifting and concept-shifting datasets show that clustering classifiers enables our proposed system to understand the sequence and similarity of past concepts.
The research of consumer model aims to explain a common fact of human behavior at consumption. Recently, large scale data of human behavior records in daily life or shopping such as POS data can be observed by a development of sensor networks or ubiquitous system. However, the a nity of the consumer behavior model and the large scale data is poor. The present paper surveys a transition of consumer behavior model and describes the looking toward an efficient utilization of large scale data with consumer behavior model.
In this paper, we propose CF-Suffix Trie for mining frequent moving patterns from spatiotemporal data and an online algorithm for constructing the trie. Our methods can discover patterns and their related spatial regions automatically with only a single scan of the data. We evaluate our methods experimentally using datasets with artificial object trajectories. The performance experiment shows that our method are more than 1000 times fater than naive methods and exhibits more than 95 % of precision.
In this paper, we propose an e cient algorithm for mining frequent right-closed sequences from a single long sequences without candidate generation. The purpose is to compress a huge set of frequent sequences, extracting in mining process. To measure the frequency of each sequences, we adopt the Head frequency which can count multiple occurrences of each subsequence without irrational duplications. The search space can largely be reduced by applying right-anti-monotonicity of the head frequency.
We propose an on-line approximation high-speed algorithm for extracting frequents ubsequences from a stream data. In an on-line algorithm, suppressing memory consumption is very important, thus, an on-line algorithm often takes a form of approximation algorithm, where the error ratio should be guaranteed to be lower than a user-specified threshold value. Our Algorithm is based on LOSSY_COUNTING Algorithm[1, 4], which is famous and can extracts frequent items from a stream data. We extend LOSSY COUNTING Algorithm to extract frequent subsequences from a stream data by using Head Frequency that is a measure for frequency of subsequences. We estimate approximation accuracy of the proposed algorithm and the space complexity. The order of memory consumption is M/ϵ logN, where M is the maximum number of subsequences obtained in each window, ϵ is an user-specified error ratio, and N is the length of a stream data. Through experiments we show that the proposed algorithm has good scalability to the length of a stream data and can suppress the memory consumption being lower than the estimated value
This paper shows a new method of extracting important words from newspaper corpus based on the temporal-dependency between word occurrences. This word extraction method plays an important role in event-sequence mining. TF·IDF is a well-known method to rank word's importance in a document. We already proposed a new word-extraction method of improving TFIDF method, called TF·IDayF,which considers temporal information of word occurrences and can extract important/characteristic words of expressing sequential events. However, this method does not consider any temporal dependency of word occurrences, which can be regarded as some causal relationships. In this paper, we propose a novel method for extracting important words by using temporal co-occurrence information of words in a newspaper corpus.
Incremental and decremental algorithm of the Support Vector Machines (SVM) [1, 2, 3] efficiently updates trained SVM parameters whenever a data point is added to or removed from the training set. When we need to add or remove many data points, the computational cost of these methods becomes inhibitive because we have to repeatedly apply the method for each data point. In this paper, we generalize the existing decremental algorithm of Support Vector Regression (SVR) [2, 3] in such a way that several data points can be removed more efficiently. In our proposed approach, which we call generalized decremental SVR (GDSVR), we consider a path-following problem in multi-dimensional parameter space. The experimental results show thatGDSVR can reduce the computational cost of leave-m-out cross-validation (m > 1). In particular, we observed that the number of breakpoints, which is the main computational cost of the involved path-following, were reduced from O(m) to O(√m).
We propose the approximated calculation for the Generalized Information Criteria(GIC) in which the influence function is approximately calculated based on cross validations. By this method, We can estimate the GIC for the L1 regularized log linear model and Support Vector Machine. The GIC for these models have never been estimated exactly. The experiments shows that proposed approximated GIC is effective to estimate the valid regularized parameter for models..
The Bethe approximation, or loopy belief propagation algorithm is a successful method for approximating partition functions of probabilistic models associated with a graph. Chertkov and Chernyak derived an interesting formula called "Loop Series Expansion", which is an expansion of the partition function. The main term of the series is the Bethe approximation while other terms are labeled by subgraphs called generalized loops. In our recent paper, we derive the loop series expansion in form of a polynomial with coefficients positive integers, and extend the result to the expansion of marginals. In this paper, we give more clear derivation for the results and discuss the properties of the polynomial which is introduced in the earlier paper.
We consider the problem of minimizing the spread of undesirable things, such as computer viruses or malicious rumors, by blocking a limited number of links in a network, a converse problem to the influence maximization problem in which the most influential nodes for information diffusion are searched in a social network. This minimization problem is another approach to the problem of preventing the spread of contamination by removing nodes in a network. We propose a method for efficiently finding a good approximate solution to this problem based on a naturally greedy strategy. Using large real networks, we demonstrate experimentally that the proposed method significantly outperforms conventional link-removal methods. We also show that unlike the case of blocking a limited number of nodes, the strategy of removing nodes with high out-degrees is not necessarily effective for our problem.
The characteristic rule induction usually produces a large number of rules, and it is difficult for a user to inspect all rules. This paper describes a method to give a priority index to rules based on their supporting instances, and it guides a user to inspect the most useful rule successively. The priority index is calculated dynamically at each step of rule set inspection using the covered instances by the employed rules, and the resulting rule group obtained gives a concise understanding to the data of the target class.
We propose several simple numerical indices for representing structural difference between molecular graphs, which is based on vertex difference and edge difference. In this work, we simply employed molecular graphs which all of their simple graph representations are isomorphic. The vertex difference describes the difference in terms of atom type on the same simple graph molecular framework. The edge difference describes the difference in terms of bond type. Alternatively, chemical structure difference was also defined that describes the difference of both atom type and bond type. We employed these indices for similar structure searching. The results showed that the structure-difference based searching gives us similar structure searching that is considerably different from conventional methods.
Solving subgraph isomorphism problem has emerged as a major part in study of graph mining, while its computational complexity is known to be NP-complete. To address this complexity issue, a subgraph isomorphism checking approach between two graphs has been assessed by applying Cauchy's Interlace theorem to symmetric matrices such as adjacency matrices representing the graphs because of its low computational complexity O(n3). However, the accuracy of this approach is known to be low, when we simply assign edge label IDs in the graphs to the elements of their adjacency matrices. In this paper, we propose a novel approach called OPTSPEC (OPTimized SPECtra for subgraph isomorphism checking) which optimizes a mapping from a substructure of graphs to an element of their adjacency matrices so as to maximize the effectiveness of Interlace theorem. We experimentally evaluated our approach by using artificial graph data, and confirmed the significant accuracy of its subgraph isomorphism checking.