In this paper, we propose a combined use of transformed images and vision transformer (ViT) models transformed with a secret key. We show for the first time that models trained with plain images can be directly transformed to models trained with encrypted images on the basis of the ViT architecture, and the performance of the transformed models is the same as models trained with plain images when using test images encrypted with the key. In addition, the proposed scheme does not require any specially prepared data for training models or network modification, so it also allows us to easily update the secret key. In an experiment, the effectiveness of the proposed scheme is evaluated in terms of performance degradation and model protection performance in an image classification task on the CIFAR-10 dataset.
In this paper, we propose an access control method with a secret key for object detection models for the first time so that unauthorized users without a secret key cannot benefit from the performance of trained models. The method enables us not only to provide a high detection performance to authorized users but to also degrade the performance for unauthorized users. The use of transformed images was proposed for the access control of image classification models, but these images cannot be used for object detection models due to performance degradation. Accordingly, in this paper, selected feature maps are encrypted with a secret key for training and testing models, instead of input images. In an experiment, the protected models allowed authorized users to obtain almost the same performance as that of non-protected models but also with robustness against unauthorized access without a key.
Recently, deep learning for image generation with a guide for the generation has been progressing. Many methods have been proposed to generate the animation of facial expression change from a single face image by transferring some facial expression information to the face image. In particular, the method of using facial landmarks as facial expression information can generate a variety of facial expressions. However, most methods do not focus on anime characters but humans. Moreover, we attempted to apply several existing methods to anime characters by training the methods on an anime character face dataset; however, they generated images with noise, even in regions where there was no change. The first order motion model (FOMM) is an image generation method that takes two images as input and transfers one facial expression or pose to the other. By explicitly calculating the difference between the two images based on optical flow, FOMM can generate images with low noise in the unchanged regions. In the following, we focus on the aspect of the face image generation in FOMM. When we think about the employment of facial landmarks as targets, the performance of FOMM is not enough because FOMM cannot use a facial landmark as a facial expression target because the appearances of a face image and a facial landmark are quite different. Therefore, we propose an advanced FOMM method to use facial landmarks as a facial expression target. In the proposed method, we change the input data and data flow to use facial landmarks. Additionally, to generate face images with expressions that follow the target landmarks more closely, we introduce the landmark estimation loss, which is computed by comparing the landmark detected from the generated image with the target landmark. Our experiments on an anime character face image dataset demonstrated that our method is effective for landmark-guided face image generation for anime characters. Furthermore, our method outperformed other methods quantitatively and generated face images with less noise.
Monocular depth estimation has improved drastically due to the development of deep neural networks (DNNs). However, recent studies have revealed that DNNs for monocular depth estimation contain vulnerabilities that can lead to misestimation when perturbations are added to input. This study investigates whether DNNs for monocular depth estimation is vulnerable to misestimation when patterned light is projected on an object using a video projector. To this end, this study proposes an evolutionary adversarial attack method with multi-fidelity evaluation scheme that allows creating adversarial examples under black-box condition while suppressing the computational cost. Experiments in both simulated and real scenes showed that the designed light pattern caused a DNN to misestimate objects as if they have moved to the back.
There have been lots of previous studies on fluency evaluation of spontaneous speech. However, most of them focus on lexical cues, and little emphasis is placed on how diverse acoustic features and deep end-to-end models contribute to improving the performance. In this paper, we describe multi-layer neural network to investigate not only lexical features extracted from transcription, but also consider utterance-level acoustic features from audio data. We also conduct the experiments to investigate the performance of end-to-end approaches with mel-spectrogram in this task. As the speech fluency evaluation task, we evaluate our proposed method in two binary classification tasks of fluent speech detection and disfluent speech detection. Speech data of around 10 seconds duration each with the annotation of the three classes of “fluent,” “neutral,” and “disfluent” is used for evaluation. According to the two way splits of those three classes, the task of fluent speech detection is defined as binary classification of fluent vs. neutral and disfluent, while that of disfluent speech detection is defined as binary classification of fluent and neutral vs. disfluent. We then conduct experiments with the purpose of comparative evaluation of multi-layer neural network with diverse features as well as end-to-end models. For the fluent speech detection, in the comparison of utterance-level disfluency-based, prosodic, and acoustic features with multi-layer neural network, disfluency-based and prosodic features only are better. More specifically, the performance improved a lot when removing all of the acoustic features from the full set of features, while the performance is damaged a lot if fillers related features are removed. Overall, however, the end-to-end Transformer+VGGNet model with mel-spectrogram achieves the best results. For the disfluent speech detection, the multi-layer neural network using disfluency-based, prosodic, and acoustic features without fillers achieves the best results. The end-to-end Transformer+VGGNet architecture also obtains high scores, whereas it is exceeded by the best results with the multi-layer neural network with significant difference. Thus, unlike in the fluent speech detection, disfluency-based and prosodic features other than fillers are still necessary in the disfluent speech detection.
This paper describes a novel low-latency 4K 60 fps HEVC (high efficiency video coding)/H.265 multi-channel encoding system with content-aware bitrate control for live streaming. Adaptive bitrate (ABR) streaming techniques, such as MPEG-DASH (dynamic adaptive streaming over HTTP) and HLS (HTTP live streaming), spread widely on Internet video streaming. Live content has increased with the expansion of streaming services, which has led to demands for traffic reduction and low latency. To reduce network traffic, we propose content-aware dynamic and seamless bitrate control that supports multi-channel real-time encoding for ABR, including 4K 60 fps video. Our method further supports chunked packaging transfer to provide low-latency streaming. We adopt a hybrid architecture consisting of hardware and software processing. The system consists of multiple 4K HEVC encoder LSIs that each LSI can encode 4K 60 fps or up to high-definition (HD) ×4 videos efficiently with the proposed bitrate control method. The software takes the packaging process according to the various streaming protocol. Experimental results indicate that our method reduces encoding bitrates obtained with constant bitrate encoding by as much as 56.7%, and the streaming latency over MPEG-DASH is 1.77 seconds.
Paragraph segmentation is a text segmentation task. Iikura et al. achieved excellent results on paragraph segmentation by introducing focal loss to Bidirectional Encoder Representations from Transformers. In this study, we investigated paragraph segmentation on Daily News and Novel datasets. Based on the approach proposed by Iikura et al., we used auxiliary loss to train the model to improve paragraph segmentation performance. Consequently, the average F1-score obtained by the approach of Iikura et al. was 0.6704 on the Daily News dataset, whereas that of our approach was 0.6801. Our approach thus improved the performance by approximately 1%. The performance improvement was also confirmed on the Novel dataset. Furthermore, the results of two-tailed paired t-tests indicated that there was a statistical significance between the performance of the two approaches.
Human skin visualization in the beauty industry with a smart-phone based on deep learning was discussed. Skin was photographed with a medical camera that could simultaneously capture RGB and UV images of the same area. Smartphone RGB images were converted into versions similar to medical RGB and UV images via a deep learning method called cycle-GAN, which was trained with the medical and the smartphone images. After converting the smartphone image into a version similar to a medical RGB image using cycle-GAN, the processed image was also converted into a pseudo-UV image via a deep learning method called U-NET. Hidden age spots were effectively visualized by this image. RGB and UV images similar to medical images can be captured with a smartphone. Provided the neural network on deep learning is trained, a medical camera is not required.
In time difference of arrival-based signal source location estimation, geometrical errors are caused by the location of multiple unmanned aerial vehicles (UAV). Herein, we propose a divide-and-conquer algorithm to determine the optimal location for each UAV. Simulations results confirm that multiple UAVs shifted to an optimal position and the location accuracy improved.
Constructing accurate similarity graph is an important process in graph-based clustering. However, traditional methods have three drawbacks, such as the inaccuracy of the similarity graph, the vulnerability to noise and outliers, and the need for additional discretization process. In order to eliminate these limitations, an entropy regularized unsupervised clustering based on maximum correntropy criterion and adaptive neighbors (ERMCC) is proposed. 1) Combining information entropy and adaptive neighbors to solve the trivial similarity distributions. And we introduce ℓ0-norm and spectral embedding to construct similarity graph with sparsity and strong segmentation ability. 2) Reducing the negative impact of non-Gaussian noise by reconstructing the error using correntropy. 3) The prediction label vector is directly obtained by calculating the sparse strongly connected components of the similarity graph Z, which avoids additional discretization process. Experiments are conducted on six typical datasets and the results showed the effectiveness of the method.
Lifelong language learning (LLL) aims at learning new tasks and retaining old tasks in the field of NLP. LAMOL is a recent LLL framework following data-free constraints. Previous works have been researched based on LAMOL with additional computing with more time costs or new parameters. However, they still have a gap between multi-task learning (MTL), which is regarded as the upper bound of LLL. In this paper, we propose Metacognitive Adaptation (Metac-Adapt) almost without adding additional time cost and computational resources to make the model generate better pseudo samples and then replay them. Experimental results demonstrate that Metac-Adapt is on par with MTL or better.