Detecting Textual Backdoor Attacks via Class Difference for Text Classification System

Hyun KWON; Jun LEE

doi:10.1587/transinf.2023EDP7160

Abstract

A backdoor sample attack is an attack that causes a deep neural network to misrecognize data that include a specific trigger because the model has been trained on malicious data that insert triggers into the deep neural network. The deep neural network correctly recognizes data without triggers, but incorrectly recognizes data with triggers. These backdoor attacks have mainly been studied in the image domain; however, defense research in the text domain is insufficient. In this study, we propose a method to defend against textual backdoor samples using a detection model. The proposed method detects a textual backdoor sample by comparing the resulting value of the target model with that of the model trained on the original training data. This method can defend against attacks without access to the entire training data. For the experimental setup, we used the TensorFlow library, and the MR and IMDB datasets were used as the experimental datasets. As a result of the experiment, when 1000 partial training datasets were used to train the detection model, the proposed method could classify the MR and IMDB datasets with detection rates of 79.6% and 83.2%, respectively.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!