Identification of Cybersecurity Specific Content Using Different Language Models

Otgonpurev Mendsaikhan; Hirokazu Hasegawa; Yukiko Yamaguchi; Hajime Shimada; Enkhbold Bataa

doi:10.2197/ipsjjip.28.623

Abstract

Given the sheer amount of digital texts publicly available on the Internet, it becomes more challenging for security analysts to identify cyber threat related content. In this research, we proposed to build an autonomous system to identify cyber threat information from publicly available information sources. We examined different language models to utilize as a cybersecurity-specific filter for the proposed system. Using the domain-specific training data, we trained Doc2Vec and BERT models and compared their performance. According to our evaluation, the BERT-based Natural Language Filter is able to identify and classify cybersecurity-specific natural language text with 90% accuracy.

Content from these authors

Favorites & Alerts

Add to favorites
Additional info alert
Citation alert
Authentication alert

Corresponding author

Register with J-STAGE for free!