Adversarial examples can be used to exploit vulnerabilities in neural networks and threaten their sensitive applications. Adversarial attacks are evolving daily, and are rapidly rendering defense methods that assume specific attacks obsolete. This paper proposes a new defense method that does not assume a specific adversarial attack, and shows that it can be used efficiently to protect a network from a variety of adversarial attacks. Adversarial perturbations are small values; consequently, an image quality recovery method is considered to be an effective way to remove adversarial perturbations because such a method often includes a smoothing effect. The proposed method, called the denoising-based perturbation removal network (DPRNet), aims to eliminate perturbations generated by an adversarial attack for image classification tasks. DPRNet is an encoder–decoder network that excludes adversarial images during training and can reconstruct a correct image from an adversarial image. To optimize DPRNet’s parameters for eliminating adversarial perturbations, we also propose a new perturbation removal loss (PRloss) metric, which consists of a reconstructed loss and a Kullback–Leibler divergence loss that expresses the class probability distribution difference between an original image and a reconstructed image. To remove adversarial perturbation, the proposed network is trained using various types of distorted images considering the proposed PRloss metric. Thus, DPRNet eliminates image perturbations, allowing the images to be classified easily. We evaluate the proposed method using the MNIST, CIFAR-10, SVHN, and Caltech 101 datasets and show that the proposed defense method invalidates 99.8%, 95.1%, 98.7%, and 96.0% of the adversarial images that are generated by several adversarial attacks in the MNIST, CIFAR-10, SVHN, and Caltech 101 datasets, respectively.
抄録全体を表示