This paper proposes a method to recommend the Web contents (a novel, comics) in accord with the taste of the user to a user with evaluated web content by the similarity of the review. In late years there are many studies that recommend Web contents to the user by acquiring the taste of the user. These studies show personalized information to a user to recommend the contents that matched the taste of various users. Our method supposes the taste of the user from the review of the contents that acquired from a user. Beforehand, classify the sentences of the review of contents in "a sentence related to the contents" and "a sentence to express the impression of the user who reviewed" and accumulate in the system. And our method recommends the contents that resemble taste of the user to a user by comparing "the review of contents accumulated in the system" with "the review of the contents that acquired from a user" by each classification.
In this report, we describe a knowledge discovery method from probability structure model constructed by large scale data fusion concerning a buying behavior in daily life. A latent class model is proposed in order to segment into a customer category and item category which is estimated from an ID-POS data and questionnaire data of customer's life styles and personalities. The variables which includes such category label and feature of customers and items is modeled as Bayesian network for knowledge discovery.
回帰解析の場面では,目的変数と説明変数の間にある(確率) モデルを想定し,そのモデルに則した形で観測データに対する統計的解釈を付与することが一般的である.しかし目的変数が1 個,説明変数が複数個ある場面において,パラメータに関する線形性(加法性) を想定した線形回帰モデルでは現実の現象を捉えたモデルを構築することは困難である.一つの対処法は,モデル内に非線形構造および交互作用構造を含めることができる樹木構造接近法を用いることである.樹木構造接近法は,Breiman et al.(1984) による分類回帰樹木(CART:Classi cation and RegressionTrees) 法の提案以降,様々な手法が統計科学あるいはデータ・マイニングの分野で提案されている.近年では,樹木構造接近法の低い予測確度の問題を回避するための方法,すなわちアンサンブル学習法が注目されている.アンサンブル学習法とは,樹木モデル(弱学習器) を統合することで,高い予測確度をもつことができる方法である.その代表的な手法の一つが確率化平衡樹木(RF:RandomForest:Breiman, 2001) 法である.このとき,樹木の構築過程に縮小推定量を加味することで,より良好な推定量を得られることがFriedman & Popescu(2004) により指摘されている.本発表では,RF 法に縮小推定量のひとつであるLasso(Tibshirani,1996) を加味させたLasso 調整型確率化平衡樹木(Lasso-RF) 法を提案する.
Owing to the volume of data generated in recent computations and experiments, it is quite difficult to extract useful information from these data even if using scientific/information visualization techniques. Method or methodology to extract useful information from such data should be considered. Several concepts of very large scale visualization are proposed in this situation. Most of them are based on high-performance computing techniques or highly-efficient devices for computer graphics. Although such studies have succeeded in visualizing ultra-scale data, several issues remain unsolved. In this paper, a flexible visualization methodology based on "post visualization process", which includes a human recognition process and quantitative evaluations of visualized results is introduced. Finally, a possibility that a visualization agent designed from a process model helps to reduce the difficulty of handling huge data is described.
In this paper, we present a method to characterize given datasets based on objective rule evaluation indices and classification rule learning algorithms. For transfer learning approach, most of methods to detect the limitations use performance indices of sets of classifiers such as accuracies of classifier sets. However, those of each classifier are also useful. By considering the issue, we performed a case study to identify similarity of datasets even if the datasets have totally different attribute sets, comparing with the conventional data characterizing technique.
Stochastic gradient boosting is a kind of the boosting methods invented by Jerome H.Friedman and it is known to be a very powerful method for making predictive models in some cases. In fact, FEG wins the second prize in KDD Cup 2009 by using this method. We survey the methodology of stochastic gradient boosting and introduce our analytical procedure in KDD Cup 2009. It is a good example where stochastic gradient boosting shows its effectiveness.
Detection of genes which are differently expressed between distinct conditions is important task in bioinformatics. Recently, epigeneitic markers turn out to have more direct relatioship with phenotypes than gene expression. In this talk, we will demostrate how well epigentic marker can be used to detect difference between conditions. Espetially, using PCA is more efficient to achieve this task.
We propose a learning algorithm for nonparametric estimation and on-line prediction for general stationary ergodic sources. We divide the real space R into a set A of finite subsets, transform a given sequence in R into the sequence in A to encode the latter using universal coding for finite sequences with distortion. We prepare infinitely many such A, and mixture the estimated measure to obtain a measure of sequences in R which may be either discrete or continuous. If the sequence is emitted by a stationary ergodic source, then the Kullback-Leibler information divided by the sequence length n converges to zero as n goes to infinity. In particular, for continuous sources, the method does not require existence of a probability density function. In this sense, this paper extends Ryabko's universal measure. The measure can be used for online prediction to estimate next data given the past sequence.
We propose online prediction algorithms for data streams whose characteristics might change over time. Our algorithms are applications of online learning with experts. In particular, our algorithms combine base predictors over sliding windows with different length as experts. As a result, our algorithms are guaranteed to be competitive with the base predictor with the best fixed-length sliding window in hindsight.
Density ratio estimation has gathered a great deal of attention recently since it can be used for various data processing tasks. In this paper, we consider three methods of density ratio estimation: (A) the numerator and denominator densities are separately estimated and then the ratio of the estimated densities is computed, (B) a logistic regression classifier discriminating denominator samples from numerator samples is learned and then the ratio of the posterior probabilities is computed, and (C) the density ratio function is directly modeled and learned by minimizing the empirical Kullback-Leibler divergence. We first prove that when the numerator and denominator densities are known to be members of the exponential family, (A) is better than (B) and (B) is better than (C). Then we show that once the model assumption is violated, (C) is better than (A) and (B). Thus in practical situations where no exact model is available, (C) would be the most promising approach to density ratio estimation.
This paper studies a technique to improve regression with unlabeled data. The key idea of our proposal is that the semi-supervised learning can be recasted as a regression problem under covariate shift. The weighted likelihood approach is a natural choice for estimating regression parameters under covariate shift. Literature [9] showed that the optimal choice of weight function is the ratio of labeled data density to unlabelled data density. In application of this idea to our setting, the optimal weight function is trivially taking always the value one. However, our proposal is to discard this optimal weight function and to estimate it. This is deeply related to the work by [5]. The resultant algorithm is shown to perform well by some experiments.
We study the problem of mining closed frequent tree patterns from tree databases that are updated regularly over time. Frequent tree mining, like frequent itemset mining, is often a very time consuming process, and thus, it is undesirable to mine from scratch when the change to the database is small. The set of previous mined patterns, which also can be considered as a description of the database, should be reused as much as possible to compute new emerging patterns. We proposed, in this paper, a novel and efficient incremental mining algorithm for closed frequent labeled ordered trees. We adopted a divide-and-conquer strategy and applied different mining techniques in different parts of the mining process. No additional scan of the whole database is needed and just a relative small amount of information from previous mining iteration has to be maintained. Our experimental study on real-life datasets demonstrates the efficiency and scalability of our algorithms.
This paper describes about the problem on estimation of disease risks with a large health checkup database. The proposed method uses a naive Bayesian classifier with the extension of two dimensional kernel density estimation technique. The framework is tested by estimation of disease risks for examinee with three diseases, hypertension, diabetes and dyslipidemia. Combination of attribute interactions and naive Bayesian method shows considerable improvement in estimation experiments.
In this paper, we propose egogram estimation method from weblog text data. Egogram is one of the personality models which illustrate the ego states of the users. In our method, the features which is appropriate for egogram are selected using the information gain of the each word which is contained in weblog text, and estimation is performed by Multinomial Naive Bayes classifiers. We evaluate our method in some classification scenario and show its effectiveness.
We study the empirical spectral distribution of so-called large dimensional random matrices. By empirical process theory and measure concentration inequalities, we provide a sufficient condition for the sum of the largest eigenvalues of the sample covariance matrix to be consistent, in the limit of the sample size n with the dimension d of data in the sample varying along n.
We study tensor-based Bayesian probabilistic modeling of heterogeneously attributed multi-dimensional arrays each of which assumes a different exponential-family distribution. Simulation experiments show that our method outperforms other methods such as PARAFAC and Tucker decomposition in missing-values prediction for cross-national statistics. We further show that the method is applicable to discover anomalies in heterogeneous office-logging data.
When we apply machine learning or data mining technique to sequential data, it is often required to take a summation over all the possible sequences. We cannot calculate such a summation directly from its definition in practice. Although the ordinary forward-backward algorithm provides an efficient way to do it, it is applicable to quite limited types of summations. In this paper, we propose general algebraic frameworks for generalization of the forward-backward algorithm. We show some examples falling within this framework and their importance.
A method is presented to discover a network topology and transmission parameters behind an infectious disease outbreak from a given time sequence dataset. A likelihood function is derived analytically from the equations which describes the stochastic process for reaction and diffusion in a metapopulation network. The method is potentially applicable to discovering the networks which mediate the diffusion of rumors, information, new ideas, or influence.
The mining of a complete set of frequent subgraphs from labeled graph data has been studied extensively. Furthermore, much attention has recently been paid to frequent pattern mining from graph sequences (dynamic graphs or evolving graphs). In this paper, we define a novel class of subgraph subsequence called an "induced subgraph subsequence" to enable efficient mining of a complete set of frequent patterns from graph sequences containing large graphs and long sequences. We also propose an efficient method to mine frequent patterns, called "FRISSs (Frequent Relevant, and Induced Subgraph Subsequences)", from graph sequences. The fundamental performance of the method was evaluated using artificial datasets, and its practicality was confirmed through experiments using a real-world dataset.