The purpose of this paper is to re-examine the text categorization research and discuss the future direction. Text categorization - the assignment of texts to predefined category based on their content - needs many procedures. The basic elements which constitute an automated text categorization are text structure, data size, feature extraction, feature selection, text representation, similarity measure, category representation, category assignment method, and evaluation method. Each element and relationships among the elements were clarified from the previous researches in text categorization. As the result, a) text structure and feature selection have big influence on the performance of text categorization, b) category representation and similarity measure have strong connection with each other, c) feature extraction which is important element is influenced by outside factor, but this method has big influence on the performance of text categorization. Furthermore, text categorization for Web pages is discussed. New problems with text structure and feature selection are addressed. Text structure becomes a more important element for improving the performance of text categorization. Feature selection has new problems, such as feature selection method for small size texts, in addition to existing ones.
View full abstract