Over the past few years, convolutional neural networks (CNN) have set the state of the art in a wide variety of supervised computer vision problems. Most research effort has focused on single-label classification, due to the availability of the large scale ImageNet dataset. Via pre-training on this dataset, CNNs have also shown the ability to outperform traditional methods for multi-label classification. Such methods, however, typically require evaluating many expensive forward passes to produce a multi-label distribution. Furthermore, due to the lack of a large scale multi-label dataset, little effort has been invested into training CNNs from scratch with multi-label data. In this paper, we address both issues by introducing a multi-label cost function adequate for deep CNNs, and a prediction method requiring only a single forward pass to produce multi-label predictions. We show the performance of our method on a newly introduced large scale multi-label dataset of animation images. Here, our method reaches 75.1% precision and 66.5% accuracy, making it suitable for automated annotation in practice. Additionally, we apply our method to the Pascal VOC 2007 dataset of natural images, and show that our prediction method outperforms a comparable model for a fraction of the computational cost.
View full abstract