Since the late 1980's, studies on Knowledge Discovery in Databases(KDD)has been paid much attentions. Briefly speaking, KDD processes can be divided into the four processes ((1) data selection, (2)pre-processing(or data cleaning), (3)data mining, and (4)interpretation and evaluation). Among these, the pre-processing is regarded as an most important process to prevent KDD processes from extracting meaningless rules. Generalization of databases used in an attribute-oriented induction can be considered as one of useful methods of the pre-processing. An attribute-oriented induction is a useful data mining method that generalizes databases under an appropriate abstraction hierarchy to find meaningful konwledge. The hierarchy is well designed so as to exclude meaningless rules from a particular point of view. However, there may exist several ways of generalizing databases according to user's intention. It is therefore important to provide a multi-layered abstraction hierarchy under which several generalizations are possible and are well controlled. In fact, too-general or too-specific databases are inappropriate for mining algorithms to extract significant rules. From this viewpoint, we propose a generalization method based on an information theoretical measure to select an appropriate abstraction hierarchy. The hierarchy can be considered as a layered abstraction, which is defined as a grouping of attribute values at concrete level. Futhermore, we consider a data selection method to extract meaningful rules. The method controls weight values(called votes)to extract a subset of tuples in the original database. The subset should be selected so that it forms a relatively meaningful mass of tuples. Then the same discovery method is applied for the subset to produce a hidden rule for the subset, not for the whole database. Finally, we present a system, ITA(Information Theoretical Abstraction), based on the generalization and selection methods, and tested it for a census database to show the effectiveness and the validity of ITA.
抄録全体を表示