Dataset Distillation with Attention Labels for Fine-tuning BERT

Aru Maekawa; Naoki Kobayashi; Kotaro Funakoshi; Manabu Okumura

doi:10.5715/jnlp.32.283

Abstract

Dataset distillation aims to create a small dataset of informative synthetic samples to rapidly train neural networks that retain the performance of the original dataset. In this study, we focus on constructing distilled few-shot datasets for natural language processing (NLP) tasks to fine-tune pre-trained transformers. Specifically, we propose introducing attention labels, which can efficiently distill knowledge from the original dataset and transfer it to transformer models via attention probabilities. We evaluated our dataset distillation methods in four NLP tasks and demonstrated that it is possible to create distilled few-shot datasets with attention labels, yielding an impressive performance for fine-tuning BERT. Specifically, in AGNews, which is a four-class news classification task, our distilled few-shot dataset achieved up to 93.2% accuracy, which is 98.5% that of the original dataset, even with only one sample per class and only one gradient step.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!