深層学習におけるバックドア攻撃に対する蒸留を用いた対策とポイズンデータの特定

吉田 康太; 藤野 毅

doi:10.11517/pjsai.JSAI2020.0_4J3GS204

Abstract

A backdoor attack is one of the model poisoning attacks against the machine learning system such as deep neural networks (DNNs). In the backdoor attack against the image classification system, an adversary creates some tampered data that has adversarial marks and injects them into a training dataset. A DNN model that is trained by the tampered training dataset can achieve high classification accuracy for clean input data but the inference on the input data with adversarial marks is misclassified to the adversarial target label. In this paper, we propose the countermeasure against the backdoor attack utilizing knowledge distillation. A DNN model user distills clean knowledge from the backdoored model utilizing clean unlabeled data. The distilled model achieves high classification accuracy without being affected by the backdoor. Furthermore, the user distinguishes the tampered data injected into the training dataset by comparing the classification results of the backdoored model and the distilled model.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!