JSAI Technical Report, Type 2 SIG
Online ISSN : 2436-5556
Volume 2007, Issue DMSM-A702
The 5th SIG-DMSM
Displaying 1-17 of 17 articles from this issue
  • Paul SHERIDAN, Takeshi KAMIMURA, Hidetoshi SHIMODAIRA
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 01-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    This paper integrates scale-free network properties into statistical inference. This is accomplished in a meaningful manner by devising scale-free prior distributions based on three well-known scale-free network models in the framework of Gaussian graphical models. The new priors are compared with a random network prior via an extensive Markov chain Monte Carlo simulation. As well, a numerical example using microarray data to infer a protein-protein interaction network is provided.

    Download PDF (450K)
  • Koki KYO
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 02-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    Recently, in the field of human-computer interaction a model was developed for evaluating the performance of the input devices of a computer instead of the conventional Fitts' law. This model concerns two factors which are treated as systematic factor and human factor respectively, so it is called the SH-model. In this paper, in order to extend the range of application of the SH-model we propose a new model as by using the Box-Cox transformation then apply a Bayesian modeling method for estimating the learning effect of pointing tasks. We consider the parameters describing the learning effect as random variables and introduce smoothness priors for them. Illustrative results show that the newly-proposed model can be applied satisfactorily, thus providing proof of the validity of our modeling method.

    Download PDF (458K)
  • Masashi SUGIYAMA, Shinichi NAKAJIMA, Hisashi KASHIMA, Paul VONBUNAU, M ...
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 03-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    A situation where training and test samples follow different imput distributions is called covariate shift. Under covariate shift, standard learning methods such as maximum likelihood estimation are no longer consistent -- weighted variants according to the ratio of test and training input densities are consistent. Therefore, accurately estimating the density ratio, called the importance, is one of the key issues in covariate shift adaptation. A naive approach to this task is to first estimate training and test input densities separately and then estimate the importance by taking the ratio of the estimated densities. However, this naive approach tends to perform poorly since density estimation is a hard problem particularly in high dimensional cases. In this paper, we propose a direct importance estimation method that does not require density estimates. Our method is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized. Simulations illustrate the usefulness of our approach.

    Download PDF (1024K)
  • Thanh PHUONGNGUYEN, Tu BAOHO
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 04-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    This paper presents two machine learning based methods to solve two significant problems in bioinformatics: prediction of protein-protein interactions and prediction of disease genes. Protein-protein interactions (PPI) are intrinsic to almost all cellular processes,and different computational methods recently offer chances to study PPI and related problems in molecular biology and medicine. We first use inductive logic programming (ILP) to predict PPI from integrative protein domain data and genomic/proteomic data. Starting with constructed biologically significant background knowledge of more than 220,000 ground facts, we can induce ILP significant rules that better predict protein-protein interactions in comparison with other methods. We then use semi-supervised learning methods to exploit PPI data for predicting disease genes. In addition to 3,053 disease genes known in OMIM database, we found about fifty novel putative genes that are potential in causing a number of diseases.

    Download PDF (539K)
  • Ryo YOSHIDA, Tomoyuki HIGUCHI, Seiya IMOTO, Satoru MIYANO
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 05-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    We address the problem of clustering and feature extraction of exceedingly high-dimensional data, referred to as n ≪ p data, where the dimensionality of the feature space p is much higher than the number of training samples n. For such a sparsely-distributed dataset, direct application of conventional model-based clustering might be impractical due to occurrence of an over-learning. In order to overcome the limit of application, we developed the mixed factors model in Yoshida et al. (2004),which was originally aimed at solving the over-learning problem in the unsupervised discriminant analysis of gene expression profiles. The idea is to extract the feature variables involved in the underlying group structure, and then, train an unsupervized discriminative classifier by using the extracted features which are projected onto the lower-dimensional factor space. By alternating projection and clustering, the method seeks an optimal direction of projection such that the overlap of the projected clusters is small. One main purpose of this paper is to elucidate the statistical machineries of the feature extraction system offered by the mixed factors model. Particularly, we give the connection to Fisher's discriminant analysis and the principal component analysis. After showing some theoretical consequences, we also attempt to present a more generic approach of clustering within the framework of kernel machine learning. By this extension, we can deal with much more complicated shapes of clusters and clustering on the generic feature spaces.

    Download PDF (550K)
  • Jean-Philippe VERT
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 06-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    Several problems in chemistry, in particular for drug discovery, can be formulated as classification or regression problems over molecules. These molecules, when represented by their planar structure, can be seen as labeled graphs. One approach to solve such problems is to apply kernel methods, such as support vector machines, with labeled graphs as training patterns. This requires an implicit embedding of the labeled graphs to a Hilbert space, carried out in practice through the definition of a positive definite function over labeled graph. In this work I will review recent works that define such positive kernels. In particular we will see that although complete embeddings that separate non-isomorphic graphs, such as those obtained by counting all subgraphs or paths, are intractable in practice, fast approximations based on finite and infinite walk enumeration can be computed in polynomial time. These walk kernels and their variants give promising results on several benchmarks in computational chemistry.

    Download PDF (313K)
  • Kosuke ISHIBASHI, Kohei HATANO, Masayuki TAKEDA
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 07-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    We consider online learning of linear classifiers which approximately maximize the 2-norm margin. Given a linearly separable sequence of instances, typical online learning algorithms such as Perceptron and its variants, map them into an augmented space with an extra dimension, so that those instances are separated by a linear classifier without a constant bias term. However,this mapping might decrease the margin over the instances. In this paper, we propose a modified version of Li and Long's ROMMA that avoids such the mapping and we show that our modified algorithm achieves higher margin than previous online learning algorithms.

    Download PDF (596K)
  • Alexander J.SMOLA
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 08-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS
    Download PDF (311K)
  • Katsutoshi YADA
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 09-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    The purpose of this paper is to present a business application for knowledge discovery from stream data using string analysis technique to find useful rules in visiting patterns of sales area. In this paper we have focused on stationary state of customer at a certain sales area in a store. We have applied string analysis technique, EBONSAI, to sales area visiting patterns to effectively deal with huge stream data. In experiments we can extract useful rules and knowledge about charactaristics of sales area visiting patterns and verify effectiveness of our method. And we discuss about prediction accuracy and computing time and clarify technical problems of EBONSAI in future.

    Download PDF (1230K)
  • Yu NAKANO, Okada TAKASHI
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 10-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    The cascade model is a system to derive characteristic rules, in which a rule condition is expressed by "IF main_condition ADDED ON preconditions" and its consequent part is expressed by a distribution changes in the class attribute. The rule strength is shown by the BSS (between_groups sum of squares) calculated from the class distributions before and after the application of main condition. This paper proposes a generalization of the rule by the incorporation of conjunctive conditions into the main condition part. A fast and exhaustive method to enumerate all candidate rules is also implemented using the FP-Tree algorithm. Two rule selection schemes are introduced to decrease the number of rules. Experimental results on a few representative dataset are reported.

    Download PDF (467K)
  • Tetsuro NAMBA, Makoto HARAGUCHI, Yoshiaki OKUBO
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 11-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    We discuss in this paper a method for finding Pseudo-Biclusters of gene expression data. For time series data, a linear time algorithm with the help of suffix tree has been proposed. Although the algorithm can efficiently enumerate all maximal biclusters, we often observe many overlapping clusters. By combining such clusters together, we can interestingly observe that all genes in the combined cluster behaves quite similarly within a common time span, but they behaves differently after that. We expect that such an observation would provide valuable suggetions to experts. From this point of view, we introduce a notion of pseudo-biclusters. A pseudo-bicluster consists of several maximal biclusters with some overlap. We design a polynomial time algorithm for finding them with a suffix tree. Some experimental results for gene expression data of ascidian (Hoya) are also presented, showing an interesting cluster actually extracted.

    Download PDF (878K)
  • Yuichi SHIRAISHI, Kenji FUKUMIZU, Shiro IKEDA
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 12-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    Combining binary classifiers for multi-class classification problems has been very popular after the invention of SVM and ada-boost, which are known to be very effective for binary classification. In this pater, we analyze theoretically the ECOC approach, which is a standard combining method. We discuss the problelm of combinig binary classifiers form the game-theoretical point of view. First, we develop a genaral theorem for the condition of minimaxity, which is closely related to the network flow theory. Applying this theorem, we show that the ECOC approach has the minimax property in the one-vs-one and one-vs-all case.

    Download PDF (843K)
  • Hitohiro SHIOZAKI, Koji EGUCHI, Takenao OHKAWA
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 13-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    Conveying information about who, what, when and where is a primary purpose of news articles. To handle such information, statistical models that capture dependencies between named entities and topics can serve an important role. Although some relationships between who and where should be mentioned in a news story, no topic models explicitly addressed the textual interactions between a who-entity anda where-entity. This paper presents a new statistical model that directly captures dependencies between topics, who-entities and where-entities mentioned in each article. We show, through our experiments, how this multi-entity-topic model performs better at making predictions on who-entities.

    Download PDF (658K)
  • Hiroshi KUWAJIMA, TAKASHIWASHIO
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 14-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

    Efficient evaluation of the similarity measures, e.g., correlations and kernels, among objects is one of the most important tasks required by major data mining techniques. However, the complexity of direct computations of the similarity among the n objects is at least O(n2) which is practically intractable for the large number n if the computations are expensive due to high dimensionality and/or highly complexs tructure of the objects. Moreover, direct similarity observations among all objects are often prohibitively expensive in some scientific fields. The objective of this paper is to propose techniques called "Column Reduction" and "Range Limited Column Reduction" to efficiently estimate the similarity measures among the objects by using the limited number of the directly computed and/or observed similarity measures. This technique effectively uses the property of the similarity matrix named "Positive Semi-Definiteness (PSD)." The superior performance of our approach in both effi-ciency and accuracy is demonstrated though the evaluation based on artificial and real world data sets.

    Download PDF (511K)
  • [in Japanese], [in Japanese], [in Japanese], [in Japanese]
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 15-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS
  • [in Japanese], [in Japanese], [in Japanese], [in Japanese]
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages 16-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS
  • [in Japanese]
    Article type: SIG paper
    2007 Volume 2007 Issue DMSM-A702 Pages c01-
    Published: October 05, 2007
    Released on J-STAGE: August 28, 2021
    RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS
    Download PDF (144K)
feedback
Top