A Randomness Based Analysis on the Data Size Needed for Removing Deceptive Patterns

Kazuya HARAGUCHI; Mutsunori YAGIURA; Endre BOROS; Toshihide IBARAKI

doi:10.1093/ietisy/e91-d.3.781

Regular Section

A Randomness Based Analysis on the Data Size Needed for Removing Deceptive Patterns

Kazuya HARAGUCHI, Mutsunori YAGIURA, Endre BOROS, Toshihide IBARAKI

Author information

Keywords: frequent/infrequent item sets, association rules, knowledge discovery, probabilistic analysis

JOURNAL FREE ACCESS

2008 Volume E91.D Issue 3 Pages 781-788

DOI https://doi.org/10.1093/ietisy/e91-d.3.781

Details

Abstract

We consider a data set in which each example is an n-dimensional Boolean vector labeled as true or false. A pattern is a cooccurrence of a particular value combination of a given subset of the variables. If a pattern appears frequently in the true examples and infrequently in the false examples, we consider it a good pattern. In this paper, we discuss the problem of determining the data size needed for removing “deceptive” good patterns; in a data set of a small size, many good patterns may appear superficially, simply by chance, independently of the underlying structure. Our hypothesis is that, in order to remove such deceptive good patterns, the data set should contain a greater number of examples than that at which a random data set contains few good patterns. We justify this hypothesis by computational studies. We also derive a theoretical upper bound on the needed data size in view of our hypothesis.

Corresponding author

Register with J-STAGE for free!