Statistical methodology and its difficulty for causal inference are first reviewed in experimental, quasi-experimental or observational studies. The review includes path analysis,graphical modeling, Rubin's counterfactual models and propensity scores. It is then studied how those statistical methods can relate with causal discovery in data mining in computational science.
Ordered lists of objects are widely used as representational forms. Such ordered objects include Web search results and best-seller lists. Techniques for processing such ordinal data are being developed, particularly methods for the supervised ordering task: i.e., learning functions used to sort objects from sample orders. In this article, we propose dimension reduction methods specifically designed to improve prediction performance in supervised ordering tasks.
We present optimization approaches for semi-supervised learning for classification based on the formulations of Support Vector Machine (SVM) for the conventional supervised setting. We first introduce the Laplacian of a graph and the associated graph kernels which are exploited in many semi-supervised classification methods. We will show that these methods can be naturally derived from the conventional formulations of SVMs with the graph kernels. The proposed optimization problems fully enjoy the sparse structure of the graph Laplacian, which enables us to optimize the problems with a large number of data points in a practical amount of computational time. Some numerical results indicate that our approaches achieve fairly high performance on large scale problems.
In text mining,aword frequency is an important element.Moreover,when we extract a relationship between words,it is important to extract dependecy structure in a sentence. This paper proposes asemi-structure mining method extracting frequent words with dependecy structure in alarge number of text data.Our method identifies dependency as tree structure whose node is a sequence.In this way,our proposed method can extract patternswhich the conventional method can not extract.
An Auto-Regressive eXogenous input (ARX) model has been widely used in engineering fields to model dynamic response of a system to exogenous factors. A difficulty in this modeling is the determination of an appropriate model order for given data. In this paper, we develop a new and practical approach to determine the appropriate order. Moreover, we apply the developed technique to a real marketing data, and analyse dynamic response character of sales revenue to advertisement and sales promotion. In marketing study, static response of sales to some exogenous factors such as advertisement and sales promotion have been analyzed. However, if we can model daynamic response of sales to exogenous factor, more precise strategies of the sales can be designed in marketing.
We present in this article a new method to extract frequent patterns from gene networks. The particularity of this method is to be able to extract embedded sub-DAGs from the data, whereas previous methods were limited to extracting induced sub-DAGs. Our algorithm builds up upon our Dryade closed frequent embedded attribute sub-tree mining algorithm, and by postprocessing its outputs discovers closed frequent embedded attribute sub-DAGs with one root in the data. We have tested our method on real gene networks data, and confirmed the existence of specific embedded sub-DAGs, that could not be found with previous algorithms limited to extracting induced sub-DAGs.
In this paper, we notice the importance of proper noun extraction techniques developed in text mining community and apply it to realize a sophisticated text retrieval engine. More concretely, we extract proper nouns from the target contents by applying those techniques and put them as meta data to the corresponding documents together with their categories. Furthermore, we provide selected meta data as added keywords at the retrieval session to reduce the number of documents retrieved. Finally we conduct experimental studies to prove feasibility of our approach to realize effective contents retrieval.
This paper reviews the recent approaches of "kernel method" as a transform of data into the reproducing kernels.
This lecture reviews nonparametric Bayesian approach for data partitioning for complex data analysis. The Bayesian modeling gives us a principled approach for clustering a set of complex data into an unknown number of disjoint or overlapped data each of which can be represented by some simple distribution. The nonparametric Bayes, that is Dirichlet process mixture (DPM) models enables us to define distributions over the countably infinite sets that faces with the partitioning problems. Infinite Relational Model (IRM) based on DPM is also presented as a real application of DPM to relational data mining.