-
Kenta SUZUKI, Rei HAMAKAWA
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
01-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
This paper proposes a method to recommend the Web contents (a novel, comics) in accord with the taste of the user to a user with evaluated web content by the similarity of the review. In late years there are many studies that recommend Web contents to the user by acquiring the taste of the user. These studies show personalized information to a user to recommend the contents that matched the taste of various users. Our method supposes the taste of the user from the review of the contents that acquired from a user. Beforehand, classify the sentences of the review of contents in "a sentence related to the contents" and "a sentence to express the impression of the user who reviewed" and accumulate in the system. And our method recommends the contents that resemble taste of the user to a user by comparing "the review of contents accumulated in the system" with "the review of the contents that acquired from a user" by each classification.
View full abstract
-
Tsukasa ISHIGAKI, Takeshi TAKENAKA, Yoichi MOTOMURA
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
02-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
In this report, we describe a knowledge discovery method from probability structure model constructed by large scale data fusion concerning a buying behavior in daily life. A latent class model is proposed in order to segment into a customer category and item category which is estimated from an ID-POS data and questionnaire data of customer's life styles and personalities. The variables which includes such category label and feature of customers and items is modeled as Bayesian network for knowledge discovery.
View full abstract
-
Masatoshi NAKAMURA, Toshio SHIMOKAWA, Masashi GOTO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
03-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
-
Susumu SHIRAYAMA
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
04-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
Owing to the volume of data generated in recent computations and experiments, it is quite difficult to extract useful information from these data even if using scientific/information visualization techniques. Method or methodology to extract useful information from such data should be considered. Several concepts of very large scale visualization are proposed in this situation. Most of them are based on high-performance computing techniques or highly-efficient devices for computer graphics. Although such studies have succeeded in visualizing ultra-scale data, several issues remain unsolved. In this paper, a flexible visualization methodology based on "post visualization process", which includes a human recognition process and quantitative evaluations of visualized results is introduced. Finally, a possibility that a visualization agent designed from a process model helps to reduce the difficulty of handling huge data is described.
View full abstract
-
Hidenao ABE, Shuasku TSUMOTO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
05-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
In this paper, we present a method to characterize given datasets based on objective rule evaluation indices and classification rule learning algorithms. For transfer learning approach, most of methods to detect the limitations use performance indices of sets of classifiers such as accuracies of classifier sets. However, those of each classifier are also useful. By considering the issue, we performed a case study to identify similarity of datasets even if the datasets have totally different attribute sets, comparing with the conventional data characterizing technique.
View full abstract
-
Junichi KOBAYASHI, Kazuaki KOMOTO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
06-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
Stochastic gradient boosting is a kind of the boosting methods invented by Jerome H.Friedman and it is known to be a very powerful method for making predictive models in some cases. In fact, FEG wins the second prize in KDD Cup 2009 by using this method. We survey the methodology of stochastic gradient boosting and introduce our analytical procedure in KDD Cup 2009. It is a good example where stochastic gradient boosting shows its effectiveness.
View full abstract
-
Takanori AYANO, Joe SUZUKI
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
07-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
-
Y-h. TAGUCHI
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
08-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
Detection of genes which are differently expressed between distinct conditions is important task in bioinformatics. Recently, epigeneitic markers turn out to have more direct relatioship with phenotypes than gene expression. In this talk, we will demostrate how well epigentic marker can be used to detect difference between conditions. Espetially, using PCA is more efficient to achieve this task.
View full abstract
-
[in Japanese]
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
09-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
We propose a learning algorithm for nonparametric estimation and on-line prediction for general stationary ergodic sources. We divide the real space R into a set A of finite subsets, transform a given sequence in R into the sequence in A to encode the latter using universal coding for finite sequences with distortion. We prepare infinitely many such A, and mixture the estimated measure to obtain a measure of sequences in R which may be either discrete or continuous. If the sequence is emitted by a stationary ergodic source, then the Kullback-Leibler information divided by the sequence length n converges to zero as n goes to infinity. In particular, for continuous sources, the method does not require existence of a probability density function. In this sense, this paper extends Ryabko's universal measure. The measure can be used for online prediction to estimate next data given the past sequence.
View full abstract
-
Shinichi YOSHIDA, Kohei HATANO, Eiji TAKIMOTO, Masayuki TAKEDA
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
10-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
We propose online prediction algorithms for data streams whose characteristics might change over time. Our algorithms are applications of online learning with experts. In particular, our algorithms combine base predictors over sliding windows with different length as experts. As a result, our algorithms are guaranteed to be competitive with the base predictor with the best fixed-length sliding window in hindsight.
View full abstract
-
Takafumi KANAMORI, Taiji SUZUKI, Masashi SUGIYAMA
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
11-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
Density ratio estimation has gathered a great deal of attention recently since it can be used for various data processing tasks. In this paper, we consider three methods of density ratio estimation: (A) the numerator and denominator densities are separately estimated and then the ratio of the estimated densities is computed, (B) a logistic regression classifier discriminating denominator samples from numerator samples is learned and then the ratio of the posterior probabilities is computed, and (C) the density ratio function is directly modeled and learned by minimizing the empirical Kullback-Leibler divergence. We first prove that when the numerator and denominator densities are known to be members of the exponential family, (A) is better than (B) and (B) is better than (C). Then we show that once the model assumption is violated, (C) is better than (A) and (B). Thus in practical situations where no exact model is available, (C) would be the most promising approach to density ratio estimation.
View full abstract
-
Masanori KAWAKITA, Jun'ichi TAKEUCHI
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
12-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
This paper studies a technique to improve regression with unlabeled data. The key idea of our proposal is that the semi-supervised learning can be recasted as a regression problem under covariate shift. The weighted likelihood approach is a natural choice for estimating regression parameters under covariate shift. Literature [9] showed that the optimal choice of weight function is the ratio of labeled data density to unlabelled data density. In application of this idea to our setting, the optimal weight function is trivially taking always the value one. However, our proposal is to discard this optimal weight function and to estimate it. This is deeply related to the work by [5]. The resultant algorithm is shown to perform well by some experiments.
View full abstract
-
Viet ANHNGUYEN, Akihiro YAMAMOTO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
13-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
We study the problem of mining closed frequent tree patterns from tree databases that are updated regularly over time. Frequent tree mining, like frequent itemset mining, is often a very time consuming process, and thus, it is undesirable to mine from scratch when the change to the database is small. The set of previous mined patterns, which also can be considered as a description of the database, should be reused as much as possible to compute new emerging patterns. We proposed, in this paper, a novel and efficient incremental mining algorithm for closed frequent labeled ordered trees. We adopted a divide-and-conquer strategy and applied different mining techniques in different parts of the mining process. No additional scan of the whole database is needed and just a relative small amount of information from previous mining iteration has to be maintained. Our experimental study on real-life datasets demonstrates the efficiency and scalability of our algorithms.
View full abstract
-
Keiko YAMAMOTO, Satoru HAYAMIZU, Atsuyuki KAMEYAMA, Yoshikazu UCHIYAMA ...
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
14-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
This paper describes about the problem on estimation of disease risks with a large health checkup database. The proposed method uses a naive Bayesian classifier with the extension of two dimensional kernel density estimation technique. The framework is tested by estimation of disease risks for examinee with three diseases, hypertension, diabetes and dyslipidemia. Combination of attribute interactions and naive Bayesian method shows considerable improvement in estimation experiments.
View full abstract
-
Atsunori MINAMIKAWA, Hiroyuki YOKOYAMA
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
15-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
In this paper, we propose egogram estimation method from weblog text data. Egogram is one of the personality models which illustrate the ego states of the users. In our method, the features which is appropriate for egogram are selected using the information gain of the each word which is contained in weblog text, and estimation is performed by Multinomial Naive Bayes classifiers. We evaluate our method in some classification scenario and show its effectiveness.
View full abstract
-
Yohji AKAMA, Yasutaka UWANO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
16-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
We study the empirical spectral distribution of so-called large dimensional random matrices. By empirical process theory and measure concentration inequalities, we provide a sufficient condition for the sum of the largest eigenvalues of the sample covariance matrix to be consistent, in the limit of the sample size n with the dimension d of data in the sample varying along n.
View full abstract
-
Kohei HAYASHI, Takashi TAKENOUCHI, Tomohiro SHIBATA, Yuki KAMIYA, Dais ...
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
17-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
We study tensor-based Bayesian probabilistic modeling of heterogeneously attributed multi-dimensional arrays each of which assumes a different exponential-family distribution. Simulation experiments show that our method outperforms other methods such as PARAFAC and Tucker decomposition in missing-values prediction for cross-national statistics. We further show that the method is applicable to discover anomalies in heterogeneous office-logging data.
View full abstract
-
Y. NISHIMORI
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
18-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
We review algorithms and theory of manifold learning in machine learning.
View full abstract
-
Ai AZUMA, Masashi SHIMBO, Yuji MATSUMOTO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
19-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
When we apply machine learning or data mining technique to sequential data, it is often required to take a summation over all the possible sequences. We cannot calculate such a summation directly from its definition in practice. Although the ordinary forward-backward algorithm provides an efficient way to do it, it is applicable to quite limited types of summations. In this paper, we propose general algebraic frameworks for generalization of the forward-backward algorithm. We show some examples falling within this framework and their importance.
View full abstract
-
Yoshiharu MAENO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
20-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
A method is presented to discover a network topology and transmission parameters behind an infectious disease outbreak from a given time sequence dataset. A likelihood function is derived analytically from the equations which describes the stochastic process for reaction and diffusion in a metapopulation network. The method is potentially applicable to discovering the networks which mediate the diffusion of rumors, information, new ideas, or influence.
View full abstract
-
Akihiro INOKUCHI, Takashi WASHIO
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
21-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS
-
[in Japanese]
Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages
c01-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021
RESEARCH REPORT / TECHNICAL REPORT
FREE ACCESS