JSAI Technical Report, Type 2 SIG

A taste guess of the user and the contents recommendation that used a similar degree of the contents evaluation information

Kenta SUZUKI, Rei HAMAKAWA

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 01-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_01

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

This paper proposes a method to recommend the Web contents (a novel, comics) in accord with the taste of the user to a user with evaluated web content by the similarity of the review. In late years there are many studies that recommend Web contents to the user by acquiring the taste of the user. These studies show personalized information to a user to recommend the contents that matched the taste of various users. Our method supposes the taste of the user from the review of the contents that acquired from a user. Beforehand, classify the sentences of the review of contents in "a sentence related to the contents" and "a sentence to express the impression of the user who reviewed" and accumulate in the system. And our method recommends the contents that resemble taste of the user to a user by comparing "the review of contents accumulated in the system" with "the review of the contents that acquired from a user" by each classification.

View full abstract

Download PDF (620K)
Knowledge Discovery by Co-clustering of Customers and Items Based on a Large Data Fusion

Tsukasa ISHIGAKI, Takeshi TAKENAKA, Yoichi MOTOMURA

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 02-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_02

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

In this report, we describe a knowledge discovery method from probability structure model constructed by large scale data fusion concerning a buying behavior in daily life. A latent class model is proposed in order to segment into a customer category and item category which is estimated from an ID-POS data and questionnaire data of customer's life styles and personalities. The variables which includes such category label and feature of customers and items is modeled as Bayesian network for knowledge discovery.

View full abstract

Download PDF (542K)
Regression Analysis Using Lasso-Random Forest

Masatoshi NAKAMURA, Toshio SHIMOKAWA, Masashi GOTO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 03-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_03

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

[in Japanese]

View full abstract

Download PDF (109K)
Post-Process for Scientific Visualization

Susumu SHIRAYAMA

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 04-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_04

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

Owing to the volume of data generated in recent computations and experiments, it is quite difficult to extract useful information from these data even if using scientific/information visualization techniques. Method or methodology to extract useful information from such data should be considered. Several concepts of very large scale visualization are proposed in this situation. Most of them are based on high-performance computing techniques or highly-efficient devices for computer graphics. Although such studies have succeeded in visualizing ultra-scale data, several issues remain unsolved. In this paper, a flexible visualization methodology based on "post visualization process", which includes a human recognition process and quantitative evaluations of visualized results is introduced. Finally, a possibility that a visualization agent designed from a process model helps to reduce the difficulty of handling huge data is described.

View full abstract

Download PDF (331K)
An Analysis of Dataset Similarity by using Objective Rule Evaluation Indices

Hidenao ABE, Shuasku TSUMOTO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 05-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_05

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

In this paper, we present a method to characterize given datasets based on objective rule evaluation indices and classification rule learning algorithms. For transfer learning approach, most of methods to detect the limitations use performance indices of sets of classifiers such as accuracies of classifier sets. However, those of each classifier are also useful. By considering the issue, we performed a case study to identify similarity of datasets even if the datasets have totally different attribute sets, comparing with the conventional data characterizing technique.

View full abstract

Download PDF (100K)
The introduction of stochastic gradient boosting and the predictive model for telecommunication marketing (From the analysis of KDD Cup 2009)

Junichi KOBAYASHI, Kazuaki KOMOTO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 06-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_06

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

Stochastic gradient boosting is a kind of the boosting methods invented by Jerome H.Friedman and it is known to be a very powerful method for making predictive models in some cases. In fact, FEG wins the second prize in KDD Cup 2009 by using this method. We survey the methodology of stochastic gradient boosting and introduce our analytical procedure in KDD Cup 2009. It is a good example where stochastic gradient boosting shows its effectiveness.

View full abstract

Download PDF (319K)
A Consideration about Catoni's Inductive PAC-Bayesian Learning

Takanori AYANO, Joe SUZUKI

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 07-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_07

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

[in Japanese]

View full abstract

Download PDF (79K)
A significance test based upon PCA

Y-h. TAGUCHI

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 08-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_08

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

Detection of genes which are differently expressed between distinct conditions is important task in bioinformatics. Recently, epigeneitic markers turn out to have more direct relatioship with phenotypes than gene expression. In this talk, we will demostrate how well epigentic marker can be used to detect difference between conditions. Espetially, using PCA is more efficient to achieve this task.

View full abstract

Download PDF (478K)
A generalized version of Chow-Liu algorithm for data mining.

[in Japanese]

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 09-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_09

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

We propose a learning algorithm for nonparametric estimation and on-line prediction for general stationary ergodic sources. We divide the real space R into a set A of finite subsets, transform a given sequence in R into the sequence in A to encode the latter using universal coding for finite sequences with distortion. We prepare infinitely many such A, and mixture the estimated measure to obtain a measure of sequences in R which may be either discrete or continuous. If the sequence is emitted by a stationary ergodic source, then the Kullback-Leibler information divided by the sequence length n converges to zero as n goes to infinity. In particular, for continuous sources, the method does not require existence of a probability density function. In this sense, this paper extends Ryabko's universal measure. The measure can be used for online prediction to estimate next data given the past sequence.

View full abstract

Download PDF (123K)
Adaptive Online Prediction with Weighted Windows

Shinichi YOSHIDA, Kohei HATANO, Eiji TAKIMOTO, Masayuki TAKEDA

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 10-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_10

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

We propose online prediction algorithms for data streams whose characteristics might change over time. Our algorithms are applications of online learning with experts. In particular, our algorithms combine base predictors over sliding windows with different length as experts. As a result, our algorithms are guaranteed to be competitive with the base predictor with the best fixed-length sliding window in hindsight.

View full abstract

Download PDF (488K)
Theoretical Analysis of Density Ratio Estimation

Takafumi KANAMORI, Taiji SUZUKI, Masashi SUGIYAMA

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 11-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_11

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

Density ratio estimation has gathered a great deal of attention recently since it can be used for various data processing tasks. In this paper, we consider three methods of density ratio estimation: (A) the numerator and denominator densities are separately estimated and then the ratio of the estimated densities is computed, (B) a logistic regression classifier discriminating denominator samples from numerator samples is learned and then the ratio of the posterior probabilities is computed, and (C) the density ratio function is directly modeled and learned by minimizing the empirical Kullback-Leibler divergence. We first prove that when the numerator and denominator densities are known to be members of the exponential family, (A) is better than (B) and (B) is better than (C). Then we show that once the model assumption is violated, (C) is better than (A) and (B). Thus in practical situations where no exact model is available, (C) would be the most promising approach to density ratio estimation.

View full abstract

Download PDF (183K)
Improvement of regression with unlabeled data

Masanori KAWAKITA, Jun'ichi TAKEUCHI

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 12-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_12

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

This paper studies a technique to improve regression with unlabeled data. The key idea of our proposal is that the semi-supervised learning can be recasted as a regression problem under covariate shift. The weighted likelihood approach is a natural choice for estimating regression parameters under covariate shift. Literature [9] showed that the optimal choice of weight function is the ratio of labeled data density to unlabelled data density. In application of this idea to our setting, the optimal weight function is trivially taking always the value one. However, our proposal is to discard this optimal weight function and to estimate it. This is deeply related to the work by [5]. The resultant algorithm is shown to perform well by some experiments.

View full abstract

Download PDF (169K)
Incremental Mining of Closed Frequent Subtrees

Viet ANHNGUYEN, Akihiro YAMAMOTO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 13-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_13

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

We study the problem of mining closed frequent tree patterns from tree databases that are updated regularly over time. Frequent tree mining, like frequent itemset mining, is often a very time consuming process, and thus, it is undesirable to mine from scratch when the change to the database is small. The set of previous mined patterns, which also can be considered as a description of the database, should be reused as much as possible to compute new emerging patterns. We proposed, in this paper, a novel and efficient incremental mining algorithm for closed frequent labeled ordered trees. We adopted a divide-and-conquer strategy and applied different mining techniques in different parts of the mining process. No additional scan of the whole database is needed and just a relative small amount of information from previous mining iteration has to be maintained. Our experimental study on real-life datasets demonstrates the efficiency and scalability of our algorithms.

View full abstract

Download PDF (257K)
Nonparametric Extension of Naive Bayesian Classifier with Large Health Checkup Database

Keiko YAMAMOTO, Satoru HAYAMIZU, Atsuyuki KAMEYAMA, Yoshikazu UCHIYAMA ...

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 14-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_14

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

This paper describes about the problem on estimation of disease risks with a large health checkup database. The proposed method uses a naive Bayesian classifier with the extension of two dimensional kernel density estimation technique. The framework is tested by estimation of disease risks for examinee with three diseases, hypertension, diabetes and dyslipidemia. Combination of attribute interactions and naive Bayesian method shows considerable improvement in estimation experiments.

View full abstract

Download PDF (96K)
Personality Estimation from Weblog Data Using Text Mining

Atsunori MINAMIKAWA, Hiroyuki YOKOYAMA

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 15-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_15

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

In this paper, we propose egogram estimation method from weblog text data. Egogram is one of the personality models which illustrate the ego states of the users. In our method, the features which is appropriate for egogram are selected using the information gain of the each word which is contained in weblog text, and estimation is performed by Multinomial Naive Bayes classifiers. We evaluate our method in some classification scenario and show its effectiveness.

View full abstract

Download PDF (323K)
On consistency of eigenvalues for principal component analysis

Yohji AKAMA, Yasutaka UWANO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 16-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_16

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

We study the empirical spectral distribution of so-called large dimensional random matrices. By empirical process theory and measure concentration inequalities, we provide a sufficient condition for the sum of the largest eigenvalues of the sample covariance matrix to be consistent, in the limit of the sample size n with the dimension d of data in the sample varying along n.

View full abstract

Download PDF (141K)
Exponential Family Tensor Factorization for Missing Values Prediction and Anomaly Detection

Kohei HAYASHI, Takashi TAKENOUCHI, Tomohiro SHIBATA, Yuki KAMIYA, Dais ...

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 17-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_17

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

We study tensor-based Bayesian probabilistic modeling of heterogeneously attributed multi-dimensional arrays each of which assumes a different exponential-family distribution. Simulation experiments show that our method outperforms other methods such as PARAFAC and Tucker decomposition in missing-values prediction for cross-national statistics. We further show that the method is applicable to discover anomalies in heterogeneous office-logging data.

View full abstract

Download PDF (211K)
Manifold Learning and Nonlinear Dimension Reduction

Y. NISHIMORI

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 18-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_18

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

We review algorithms and theory of manifold learning in machine learning.

View full abstract

Download PDF (101K)
Semiring-based Generalization of the Forward-Backward Algorithm

Ai AZUMA, Masashi SHIMBO, Yuji MATSUMOTO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 19-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_19

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

When we apply machine learning or data mining technique to sequential data, it is often required to take a summation over all the possible sequences. We cannot calculate such a summation directly from its definition in practice. Although the ordinary forward-backward algorithm provides an efficient way to do it, it is applicable to quite limited types of summations. In this paper, we propose general algebraic frameworks for generalization of the forward-backward algorithm. We show some examples falling within this framework and their importance.

View full abstract

Download PDF (155K)
Pro ling of a network behind diffusion phenomena

Yoshiharu MAENO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 20-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_20

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

A method is presented to discover a network topology and transmission parameters behind an infectious disease outbreak from a given time sequence dataset. A likelihood function is derived analytically from the equations which describes the stochastic process for reaction and diffusion in a metapopulation network. The method is potentially applicable to discovering the networks which mediate the diffusion of rumors, information, new ideas, or influence.

View full abstract

Download PDF (151K)
[title in Japanese]

Akihiro INOKUCHI, Takashi WASHIO

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages 21-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_21

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Show abstractHide abstract

[in Japanese]

View full abstract

Download PDF (499K)
[title in Japanese]

[in Japanese]

Article type: SIG paper
2010Volume 2010Issue DMSM-A903 Pages c01-
Published: March 29, 2010
Released on J-STAGE: August 28, 2021

DOIhttps://doi.org/10.11517/jsaisigtwo.2010.DMSM-A903_c01

RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

Download PDF (235K)

Register with J-STAGE for free!