Journal of Japan Industrial Management Association
Online ISSN : 2187-9079
Print ISSN : 1342-2618
ISSN-L : 1342-2618
A Theoretical Analysis of Document Classification based on a High-dimensional Vector Space Model : Asymptotic Analysis of Classification Performance and Distance Measures(Theory and Methodology)
Masayuki GOTOTakashi ISHIDAMakoto SUZUKIShigeichi HIRASAWA
Author information
JOURNAL FREE ACCESS

2010 Volume 61 Issue 3 Pages 97-106

Details
Abstract
Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.
Content from these authors
© 2010 Japan Industrial Management Association
Previous article Next article
feedback
Top