Host: The Japanese Society for Artificial intelligence
Name : The 97th SIG-SLUD
Number : 97
Location : [in Japanese]
Date : March 08, 2023 - March 09, 2023
Pages 38-43
SNS posts are an effective information because they contain a wide variety of postings. However, posts on SNS contain unique expressions which are different from those used in newspapers and other media. Therefore, it is difficult to analyze them using traditional natural language processing, and special processing is required. In this study, we focus on Split-Characters among the unique expressions. Split-Characters refer to characters in which one character is divided into multiple characters. In the previous study, OCR was used to visually process Split-Characters. However, because OCR is a method for identifying Split-Characters by character recognition, it does not use contextual information and does not consider the propriety of the corrected sentence. In this study, we propose methods for Interpreting Split-Characters using contextual information. Three models with contextual information are used: N-gram, RNN, and BERT. We propose methods to interpret Split-Characters using these models, and verify whether the proposed methods can convert SplitCharacters into correct ones.