Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
Discovering Unusual Word Usages with Masked Language Model via Pseudo-label Training
Tatsuya AokiJey Han LauHidetaka KamigaitoHiroya TakamuraTimothy BaldwinManabu Okumura
Author information
JOURNAL FREE ACCESS

2025 Volume 32 Issue 1 Pages 134-175

Details
Abstract

User-generated texts contain not only non-standard words such as b4 for before, but unusual word usages such as catfish for a person who uses fake identity online, which requires knowledge about the words to handle such cases in natural language processing. We present a neural model for detecting the non-standard usages in social media text. To deal with the lack of training data for this task, we propose a method for synthetically generating pseudo non-standard examples from a corpus, which enables us to train the model without manually-annotated training data and for any arbitrary language. Experimental results on Twitter and Reddit datasets show that our proposed method achieves better performance than existing methods, and is effective across different languages.

Content from these authors
© 2025 The Association for Natural Language Processing
Previous article Next article
feedback
Top