Identification of Cybersecurity Specific Content Using Different Language Models

Otgonpurev Mendsaikhan; Hirokazu Hasegawa; Yukiko Yamaguchi; Hajime Shimada; Enkhbold Bataa

doi:10.2197/ipsjjip.28.623

Otgonpurev Mendsaikhan, Hirokazu Hasegawa, Yukiko Yamaguchi, Hajime Shimada, Enkhbold Bataa

著者情報

キーワード: cyber threat, NLP, Text-Classification

ジャーナルフリー

2020 年 28 巻 p. 623-632

DOI https://doi.org/10.2197/ipsjjip.28.623

詳細

抄録

Given the sheer amount of digital texts publicly available on the Internet, it becomes more challenging for security analysts to identify cyber threat related content. In this research, we proposed to build an autonomous system to identify cyber threat information from publicly available information sources. We examined different language models to utilize as a cybersecurity-specific filter for the proposed system. Using the domain-specific training data, we trained Doc2Vec and BERT models and compared their performance. According to our evaluation, the BERT-based Natural Language Filter is able to identify and classify cybersecurity-specific natural language text with 90% accuracy.

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）