人工知能学会第二種研究会資料
Online ISSN : 2436-5556
2007 巻, DMSM-A702 号
第5回データマイニングと統計数理研究会
選択された号の論文の17件中1~17を表示しています
  • Paul Sheridan, Takeshi Kamimura, Hidetoshi Shimodaira
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 01-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    This paper integrates scale-free network properties into statistical inference. This is accomplished in a meaningful manner by devising scale-free prior distributions based on three well-known scale-free network models in the framework of Gaussian graphical models. The new priors are compared with a random network prior via an extensive Markov chain Monte Carlo simulation. As well, a numerical example using microarray data to infer a protein-protein interaction network is provided.

  • Koki Kyo
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 02-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    Recently, in the field of human-computer interaction a model was developed for evaluating the performance of the input devices of a computer instead of the conventional Fitts' law. This model concerns two factors which are treated as systematic factor and human factor respectively, so it is called the SH-model. In this paper, in order to extend the range of application of the SH-model we propose a new model as by using the Box-Cox transformation then apply a Bayesian modeling method for estimating the learning effect of pointing tasks. We consider the parameters describing the learning effect as random variables and introduce smoothness priors for them. Illustrative results show that the newly-proposed model can be applied satisfactorily, thus providing proof of the validity of our modeling method.

  • Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, PaulvonBunau , M ...
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 03-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    A situation where training and test samples follow different imput distributions is called covariate shift. Under covariate shift, standard learning methods such as maximum likelihood estimation are no longer consistent -- weighted variants according to the ratio of test and training input densities are consistent. Therefore, accurately estimating the density ratio, called the importance, is one of the key issues in covariate shift adaptation. A naive approach to this task is to first estimate training and test input densities separately and then estimate the importance by taking the ratio of the estimated densities. However, this naive approach tends to perform poorly since density estimation is a hard problem particularly in high dimensional cases. In this paper, we propose a direct importance estimation method that does not require density estimates. Our method is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized. Simulations illustrate the usefulness of our approach.

  • ThanhPhuongNguyen , TuBaoHo
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 04-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    This paper presents two machine learning based methods to solve two significant problems in bioinformatics: prediction of protein-protein interactions and prediction of disease genes. Protein-protein interactions (PPI) are intrinsic to almost all cellular processes,and different computational methods recently offer chances to study PPI and related problems in molecular biology and medicine. We first use inductive logic programming (ILP) to predict PPI from integrative protein domain data and genomic/proteomic data. Starting with constructed biologically significant background knowledge of more than 220,000 ground facts, we can induce ILP significant rules that better predict protein-protein interactions in comparison with other methods. We then use semi-supervised learning methods to exploit PPI data for predicting disease genes. In addition to 3,053 disease genes known in OMIM database, we found about fifty novel putative genes that are potential in causing a number of diseases.

  • Ryo Yoshida, Tomoyuki Higuchi, Seiya Imoto, Satoru Miyano
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 05-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    We address the problem of clustering and feature extraction of exceedingly high-dimensional data, referred to as n ≪ p data, where the dimensionality of the feature space p is much higher than the number of training samples n. For such a sparsely-distributed dataset, direct application of conventional model-based clustering might be impractical due to occurrence of an over-learning. In order to overcome the limit of application, we developed the mixed factors model in Yoshida et al. (2004),which was originally aimed at solving the over-learning problem in the unsupervised discriminant analysis of gene expression profiles. The idea is to extract the feature variables involved in the underlying group structure, and then, train an unsupervized discriminative classifier by using the extracted features which are projected onto the lower-dimensional factor space. By alternating projection and clustering, the method seeks an optimal direction of projection such that the overlap of the projected clusters is small. One main purpose of this paper is to elucidate the statistical machineries of the feature extraction system offered by the mixed factors model. Particularly, we give the connection to Fisher's discriminant analysis and the principal component analysis. After showing some theoretical consequences, we also attempt to present a more generic approach of clustering within the framework of kernel machine learning. By this extension, we can deal with much more complicated shapes of clusters and clustering on the generic feature spaces.

  • Jean-PhilippeVert
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 06-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    Several problems in chemistry, in particular for drug discovery, can be formulated as classification or regression problems over molecules. These molecules, when represented by their planar structure, can be seen as labeled graphs. One approach to solve such problems is to apply kernel methods, such as support vector machines, with labeled graphs as training patterns. This requires an implicit embedding of the labeled graphs to a Hilbert space, carried out in practice through the definition of a positive definite function over labeled graph. In this work I will review recent works that define such positive kernels. In particular we will see that although complete embeddings that separate non-isomorphic graphs, such as those obtained by counting all subgraphs or paths, are intractable in practice, fast approximations based on finite and infinite walk enumeration can be computed in polynomial time. These walk kernels and their variants give promising results on several benchmarks in computational chemistry.

  • Kosuke Ishibashi, Kohei Hatano, Masayuki Takeda
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 07-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    We consider online learning of linear classifiers which approximately maximize the 2-norm margin. Given a linearly separable sequence of instances, typical online learning algorithms such as Perceptron and its variants, map them into an augmented space with an extra dimension, so that those instances are separated by a linear classifier without a constant bias term. However,this mapping might decrease the margin over the instances. In this paper, we propose a modified version of Li and Long's ROMMA that avoids such the mapping and we show that our modified algorithm achieves higher margin than previous online learning algorithms.

  • AlexanderJ.Smola
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 08-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー
  • 矢田 勝俊
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 09-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    本研究の目的は、顧客の店内購買行動に関するストリームデータに対して文字列解析技術を適用することで、正負事例それぞれがもつ販売エリア訪問パターン文字列から有用な知見を抽出することができる知識発見システムを提案することである。我々は顧客動線データの中で顧客の販売エリアへの立ち寄りに注目し、その訪問パターンを文字列で表現することによって、膨大なストリームデータを効率よく扱うことを提案した。実験の中でより多くのアイテムを購入する顧客の特徴的なエリア訪問パターンが抽出され、その有用性を示すことができた。また予測精度や計算時間、インデックス機能などの検討を行い、今後の技術的な課題を明らかにすることができた。本論文において、マーケティング分野におけるストリームデータの可能性、文字列解析手法の有用性を示唆することができた。

  • 中野 優, 岡田 孝
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 10-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    The cascade model is a system to derive characteristic rules, in which a rule condition is expressed by "IF main_condition ADDED ON preconditions" and its consequent part is expressed by a distribution changes in the class attribute. The rule strength is shown by the BSS (between_groups sum of squares) calculated from the class distributions before and after the application of main condition. This paper proposes a generalization of the rule by the incorporation of conjunctive conditions into the main condition part. A fast and exhaustive method to enumerate all candidate rules is also implemented using the FP-Tree algorithm. Two rule selection schemes are introduced to decrease the number of rules. Experimental results on a few representative dataset are reported.

  • 難波徹郎 , 原口 誠, 大久保 好章
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 11-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    本研究では,遺伝子発現データをはじめとする,時系列データを対象としたバイクラスタリングについて考察する.時系列性を考慮したバイクラスタリングでは,通常,データ行列の行と列を同時にクラスタリングすることで,ある連続した時間区間において同様の変動を示す個体群を極大バイクラスタとして抽出する.特に,接尾辞木を利用することで,これらはデータ行列サイズの線形オーダで抽出可能なことが知られている.本研究ではこの枠組を拡張し,生物学的により興味あるバイクラスタの抽出を目指す.具体的には,疑似バイクラスタの概念を導入し,ある時間区間まで同様な発現変動を示す遺伝子群が,その後枝分かれをして異なる変動を示す様子を捕まえることを試み,こうした疑似バイクラスタを接尾辞木を用いて抽出する多項式時間アルゴリズムを提案する.ホヤの遺伝子発現データを用いた計算機実験により,期待した様子が観察可能な疑似バイクラスタが得られることを確認する.

  • Yuichi Shiraishi, Kenji Fukumizu, Shiro Ikeda
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 12-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    Combining binary classifiers for multi-class classification problems has been very popular after the invention of SVM and ada-boost, which are known to be very effective for binary classification. In this pater, we analyze theoretically the ECOC approach, which is a standard combining method. We discuss the problelm of combinig binary classifiers form the game-theoretical point of view. First, we develop a genaral theorem for the condition of minimaxity, which is closely related to the network flow theory. Applying this theorem, we show that the ECOC approach has the minimax property in the one-vs-one and one-vs-all case.

  • Hitohiro Shiozaki, Koji Eguchi, Takenao Ohkawa
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 13-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    Conveying information about who, what, when and where is a primary purpose of news articles. To handle such information, statistical models that capture dependencies between named entities and topics can serve an important role. Although some relationships between who and where should be mentioned in a news story, no topic models explicitly addressed the textual interactions between a who-entity anda where-entity. This paper presents a new statistical model that directly captures dependencies between topics, who-entities and where-entities mentioned in each article. We show, through our experiments, how this multi-entity-topic model performs better at making predictions on who-entities.

  • Hiroshi KUWAJIMA, TakashiWASHIO
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 14-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    Efficient evaluation of the similarity measures, e.g., correlations and kernels, among objects is one of the most important tasks required by major data mining techniques. However, the complexity of direct computations of the similarity among the n objects is at least O(n2) which is practically intractable for the large number n if the computations are expensive due to high dimensionality and/or highly complexs tructure of the objects. Moreover, direct similarity observations among all objects are often prohibitively expensive in some scientific fields. The objective of this paper is to propose techniques called "Column Reduction" and "Range Limited Column Reduction" to efficiently estimate the similarity measures among the objects by using the limited number of the directly computed and/or observed similarity measures. This technique effectively uses the property of the similarity matrix named "Positive Semi-Definiteness (PSD)." The superior performance of our approach in both effi-ciency and accuracy is demonstrated though the evaluation based on artificial and real world data sets.

  • 中野慎也 , 上野玄太 , 中村和幸 , 樋口知之
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 15-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    Particle filter は,逐次データ同化のために応用されつつある手法の一つであるが,アンサンブルの縮退という問題がしばしば起こり,有効に機能しない場合がある.そこで,この問題を回避するためにmerging particle filter という手法が提案された.Merging particle filter では,フィルタ分布を表現するアンサンブルの各構成粒子を,予測分布を表現するアンサンブルから抽出した複数のサンプルの重みつき和によって生成する.重みつき和を取る際には,重みを適切に調整することで,フィルタ分布の平均と共分散の情報がアンサンブルで保持されるようにする.この重みの与え方には任意性があるのだが,本研究では,2 種類の重みの与え方を考え,それぞれについてデータ同化実験を行い,結果を比較した.その結果, Lorenz (1963) による低次元のモデルでは,重みのうちの一つを1 に近い値に取り,その他の重みを小さい値とした場合に,より精確な推定ができることがわかった.これは,非ガウス性の強いフィルタ分布を表現するためにこのような重みの取り方がより有効であることを示すものと考えられる.また,Lorenz and Emanuel (1998) による比較的次元の高いモデルの場合でも,アンサンブルを構成する粒子の数を多く取ることが可能ならば,低次元のモデルの場合と同様,重みのうちの一つを1 に近い値に取り,その他の重みを小さい値にした方がよい推定ができることがわかった.

  • 鹿島久嗣 , 山崎一孝 , 西郷浩人 , 猪口明博
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. 16-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー

    本論文で、我々は、目的変数が範囲として与えられるような回帰問題を考え、この問題への確率的なアプローチを提案する。この問題の最適化問題を直接的に解くことは困難であるが、近似解法としてEM アルゴリズムによる解法を与える。また、提案アプローチの有効性を、価格予測と化合物の活性予測の2 つの問題のベンチマークデータセットを用いた数値実験によって示す。

  • データマイニングと統計数理研究会
    原稿種別: 研究会資料
    2007 年 2007 巻 DMSM-A702 号 p. c01-
    発行日: 2007/10/05
    公開日: 2021/08/28
    研究報告書・技術報告書 フリー
feedback
Top