In this paper, we apply an unsupervised learning method using the EM algorithm which Nigam et al. have proposed for text classification, to disambiguation problems involving noun meanings taken up in Japanese Translation Task of SENSEVAL2. This method uses the EM algorithm, setting up hidden labels of unlabeled data as missing values of observational data, the Naive Bayes model as the generating model, and the conditional probabilities
p (
f|c) (where
f is a feature and
c is a label) as parameters of the model. As the result, the learned classifier is improved. In this study, we use only simple features for the classification, which are some words surrounding a target word. In the experiments, the precision of Naive Bayes classifier learned through only labeled data was 58.2%. The precision of the decision list learned through the same data was 58.9%, which is the Ibaraki record in the Translation Task contest. Our unsupervised learning method improved the precision to 61.8% by using unlabeled data in addition to labeled data. Furthermore, by revising a small part of labeled data, the precision levels of the Naive Bayes classifier and our unsupervised learning method were improved to 62.3% and 68.2% respectively.
View full abstract