Abstract
In this paper, Support Vector Machines has been used
to deal with multi-class Chinese official document classification. Several data retrieving techniques including sentence segmentation, term weighting, and feature extraction are adopted to implement our system. It is observed that most of misclassified documents are difficult to be labeled due to their indistinguishable document contents. Therefore, indistinguishable documents should be identified by systems in advance. In order to enhance classification accuracy and distinguishability, we first propose a general approach to identify possibly misclassified documents. Then, four OAA SVM classification methods are presented based on different learning strategies from those indistinguishable or misclassified documents. They are able to identify miclassified (indistinguishable) documents in advance and achieve accurate classification. Our experiments show that applying both indistinguishable documents and misclassified ones to the training set increases classification accuracy, and that is the most suitable for Chinese official document classification.