2017 Volume 24 Issue 2 Pages 297-314
Social media texts are often written in a non-standard style and include many lexical variants such as insertions, phonetic substitutions, and abbreviations that mimic spoken language. The normalization of such a variety of non-standard tokens is one promising solution for handling noisy text. A normalization task is very difficult for the morphological analysis of Japanese text because there are no explicit boundaries between words. To address this issue, we propose a novel method herein for normalizing and morphologically analyzing Japanese noisy text. First, we extract character-level transformation patterns based on a character alignment model using annotated data. Next, we generate both character-level and word-level normalization candidates using character transformation patterns and search for the optimal path based on a discriminative model. Experimental results show that the proposed method exceeds conventional rule-based system in both accuracy and recall for word segmentation and POS (Part of Speech) tagging.