IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Regular Section
Improving Text Categorization with Semantic Knowledge in Wikipedia
Xiang WANGYan JIARuhua CHENHua FANBin ZHOU
Author information
JOURNAL FREE ACCESS

2013 Volume E96.D Issue 12 Pages 2786-2794

Details
Abstract

Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.

Content from these authors
© 2013 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top