IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Volume E103.D, Issue 6
Displaying 1-27 of 27 articles from this issue
Special Section on Machine Vision and its Applications
  • Atsuto MAKI
    2020 Volume E103.D Issue 6 Pages 1208
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS
    Download PDF (116K)
  • Kohei SENDO, Norimichi UKITA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1209-1216
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    This paper proposes a method for heatmapping people who are involved in a group activity. Such people grouping is useful for understanding group activities. In prior work, people grouping is performed based on simple inflexible rules and schemes (e.g., based on proximity among people and with models representing only a constant number of people). In addition, several previous grouping methods require the results of action recognition for individual people, which may include erroneous results. On the other hand, our proposed heatmapping method can group any number of people who dynamically change their deployment. Our method can work independently of individual action recognition. A deep network for our proposed method consists of two input streams (i.e., RGB and human bounding-box images). This network outputs a heatmap representing pixelwise confidence values of the people grouping. Extensive exploration of appropriate parameters was conducted in order to optimize the input bounding-box images. As a result, we demonstrate the effectiveness of the proposed method for heatmapping people involved in group activities.

    Download PDF (4093K)
  • Kazuki KAWAMURA, Takashi MATSUBARA, Kuniaki UEHARA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1217-1225
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Action recognition using skeleton data (3D coordinates of human joints) is an attractive topic due to its robustness to the actor's appearance, camera's viewpoint, illumination, and other environmental conditions. However, skeleton data must be measured by a depth sensor or extracted from video data using an estimation algorithm, and doing so risks extraction errors and noise. In this work, for robust skeleton-based action recognition, we propose a deep state-space model (DSSM). The DSSM is a deep generative model of the underlying dynamics of an observable sequence. We applied the proposed DSSM to skeleton data, and the results demonstrate that it improves the classification performance of a baseline method. Moreover, we confirm that feature extraction with the proposed DSSM renders subsequent classifications robust to noise and missing values. In such experimental settings, the proposed DSSM outperforms a state-of-the-art method.

    Download PDF (1406K)
  • Songlin DU, Yuan LI, Takeshi IKENAGA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1226-1235
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    High frame rate and ultra-low delay are the most essential requirements for building excellent human-machine-interaction systems. As a state-of-the-art local keypoint detection and feature extraction algorithm, A-KAZE shows high accuracy and robustness. Nonlinear scale space is one of the most important modules in A-KAZE, but it not only has at least one frame delay and but also is not hardware friendly. This paper proposes a hardware oriented nonlinear scale space for high frame rate and ultra-low delay A-KAZE matching system. In the proposed matching system, one part of nonlinear scale space is temporally forward and calculated in the previous frame (proposal #1), so that the processing delay is reduced to be less than 1 ms. To improve the matching accuracy affected by proposal #1, pre-adjustment of nonlinear scale (proposal #2) is proposed. Previous two frames are used to do motion estimation to predict the motion vector between previous frame and current frame. For further improvement of matching accuracy, pixel-level pre-adjustment (proposal #3) is proposed. The pre-adjustment changes from block-level to pixel-level, each pixel is assigned an unique motion vector. Experimental results prove that the proposed matching system shows average matching accuracy higher than 95% which is 5.88% higher than the existing high frame rate and ultra-low delay matching system. As for hardware performance, the proposed matching system processes VGA videos (640×480 pixels/frame) at the speed of 784 frame/second (fps) with a delay of 0.978 ms/frame.

    Download PDF (3274K)
  • Songlin DU, Zhe WANG, Takeshi IKENAGA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1236-1246
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    High frame rate and ultra-low delay matching system plays an increasingly important role in human-machine interactions, because it guarantees high-quality experiences for users. Existing image matching algorithms always generate mismatches which heavily weaken the performance the human-machine-interactive systems. Although many mismatch removal algorithms have been proposed, few of them achieve real-time speed with high frame rate and low delay, because of complicated arithmetic operations and iterations. This paper proposes a temporal constraints and block weighting judgement based high frame rate and ultra-low delay mismatch removal system. The proposed method is based on two temporal constraints (proposal #1 and proposal #2) to firstly find some true matches, and uses these true matches to generate block weighting (proposal #3). Proposal #1 finds out some correct matches through checking a triangle route formed by three adjacent frames. Proposal #2 further reduces mismatch risk by adding one more time of matching with opposite matching direction. Finally, proposal #3 distinguishes the unverified matches to be correct or incorrect through weighting of each block. Software experiments show that the proposed mismatch removal system achieves state-of-the-art accuracy in mismatch removal. Hardware experiments indicate that the designed image processing core successfully achieves real-time processing of 784fps VGA (640×480 pixels/frame) video on field programmable gate array (FPGA), with a delay of 0.858 ms/frame.

    Download PDF (1785K)
  • Takeru OBA, Norimichi UKITA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1247-1256
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    This paper proposes a method to create various training images for instance segmentation in a semi-supervised manner. In our proposed learning scheme, a few 3D CG models of target objects and a large number of images retrieved by keywords from the Internet are employed for initial model training and model update, respectively. Instance segmentation requires pixel-level annotations as well as object class labels in all training images. A possible solution to reduce a huge annotation cost is to use synthesized images as training images. While image synthesis using a 3D CG simulator can generate the annotations automatically, it is difficult to prepare a variety of 3D object models for the simulator. One more possible solution is semi-supervised learning. Semi-supervised learning such as self-training uses a small set of supervised data and a huge number of unsupervised data. The supervised images are given by the 3D CG simulator in our method. From the unsupervised images, we have to select only correctly-detected annotations. For selecting the correctly-detected annotations, we propose to quantify the reliability of each detected annotation based on its silhouette as well as its textures. Experimental results demonstrate that the proposed method can generate more various images for improving instance segmentation.

    Download PDF (3638K)
  • Takuya MATSUMOTO, Kodai SHIMOSATO, Takahiro MAEDA, Tatsuya MURAKAMI, K ...
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1257-1264
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    This paper proposes a framework for automatically annotating the keypoints of a human body in images for learning 2D pose estimation models. Ground-truth annotations for supervised learning are difficult and cumbersome in most machine vision tasks. While considerable contributions in the community provide us a huge number of pose-annotated images, all of them mainly focus on people wearing common clothes, which are relatively easy to annotate the body keypoints. This paper, on the other hand, focuses on annotating people wearing loose-fitting clothes (e.g., Japanese Kimono) that occlude many body keypoints. In order to automatically and correctly annotate these people, we divert the 3D coordinates of the keypoints observed without loose-fitting clothes, which can be captured by a motion capture system (MoCap). These 3D keypoints are projected to an image where the body pose under loose-fitting clothes is similar to the one captured by the MoCap. Pose similarity between bodies with and without loose-fitting clothes is evaluated with 3D geometric configurations of MoCap markers that are visible even with loose-fitting clothes (e.g., markers on the head, wrists, and ankles). Experimental results validate the effectiveness of our proposed framework for human pose estimation.

    Download PDF (1645K)
  • Hitoshi NISHIMURA, Naoya MAKIBUCHI, Kazuyuki TASAKA, Yasutomo KAWANISH ...
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1265-1275
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Multiple human tracking is widely used in various fields such as marketing and surveillance. The typical approach associates human detection results between consecutive frames using the features and bounding boxes (position+size) of detected humans. Some methods use an omnidirectional camera to cover a wider area, but ID switch often occurs in association with detections due to following two factors: i) The feature is adversely affected because the bounding box includes many background regions when a human is captured from an oblique angle. ii) The position and size change dramatically between consecutive frames because the distance metric is non-uniform in an omnidirectional image. In this paper, we propose a novel method that accurately tracks humans with an association metric for omnidirectional images. The proposed method has two key points: i) For feature extraction, we introduce local rectification, which reduces the effect of background regions in the bounding box. ii) For distance calculation, we describe the positions in a world coordinate system where the distance metric is uniform. In the experiments, we confirmed that the Multiple Object Tracking Accuracy (MOTA) improved 3.3 in the LargeRoom dataset and improved 2.3 in the SmallRoom dataset.

    Download PDF (2783K)
  • Kenta NISHIYUKI, Jia-Yau SHIAU, Shigenori NAGAE, Tomohiro YABUUCHI, Ko ...
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1276-1286
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Driver drowsiness estimation is one of the important tasks for preventing car accidents. Most of the approaches are binary classification that classify a driver is significantly drowsy or not. Multi-level drowsiness estimation, that detects not only significant drowsiness but also moderate drowsiness, is helpful to a safer and more comfortable car system. Existing approaches are mostly based on conventional temporal measures which extract temporal information related to eye states, and these measures mainly focus on detecting significant drowsiness for binary classification. For multi-level drowsiness estimation, we propose two temporal measures, average eye closed time (AECT) and soft percentage of eyelid closure (Soft PERCLOS). Existing approaches are also based on a time domain convolutional neural network (CNN) as deep neural network models, of which layers are linked sequentially. The network model extracts features mainly focusing on mono-temporal resolution. We found that features focusing on multi-temporal resolution are effective to multi-level drowsiness estimation, and we propose a parallel linked time-domain CNN to extract the multi-temporal features. We collected an own dataset in a real environment and evaluated the proposed methods with the dataset. Compared with existing temporal measures and network models, Our system outperforms the existing approaches on the dataset.

    Download PDF (1103K)
Special Section on Knowledge-Based Software Engineering
  • Fumihiro KUMENO
    2020 Volume E103.D Issue 6 Pages 1287
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS
    Download PDF (77K)
  • Mondheera PITUXCOOSUVARN, Takao NAKAGUCHI, Donghui LIN, Toru ISHIDA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1288-1296
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    In machine translation (MT) mediated human-to-human communication, it is not an easy task to select the languages and translation services to be used as the users have various language backgrounds and skills. Our previous work introduced the best-balanced machine translation mechanism (BBMT) to automatically select the languages and translation services so as to equalize the language barriers of participants and to guarantee their equal opportunities in joining conversations. To assign proper languages to be used, however, the mechanism needs information of the participants' language skills, typically participants' language test scores. Since it is important to keep test score confidential, as well as other sensitive information, this paper introduces agents, which exchange encrypted information, and secure computation to ensure that agents can select the languages and translation services without destroying privacy. Our contribution is to introduce a multi-agent system with secure computation that can protect the privacy of users in multilingual communication. To our best knowledge, it is the first attempt to introduce multi-agent systems and secure computing to this area. The key idea is to model interactions among agents who deal with user's sensitive data, and to distribute calculation tasks to three different types of agents, together with data encryption, so no agent is able to access or recover participants' score.

    Download PDF (618K)
  • Yutaka MATSUNO, Toshinori TAKAI, Shuichiro YAMAMOTO
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1297-1308
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Assurance cases are documents for arguing that systems satisfy required properties such as safety and security in the given environment based on sufficient evidence. As systems become complex and networked, the importance of assurance cases has become significant. However, we observe that creating assurance cases has some essential difficulties, and unfortunately it seems that assurance cases have not been widely used in industries. For this problem, we have been developing assurance cases creation methods and opening workshops based on the creation methods. This paper presents an assurance cases creation method called “D-Case Steps” which is based on d* framework[1], an agent-based assurance case method, and reports the results of workshops. The results indicate that our workshops have been improved and our activities on assurance cases facilitates use of them in Japan. This paper is an extended version of [2]. We add detailed background and related works, workshops results and evaluation, and lessons learned from our a decade experiences.

    Download PDF (2169K)
  • Hiroyuki NAKAGAWA, Hironori SHIMADA, Tatsuhiro TSUCHIYA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1309-1318
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Goal modeling is a method that describes requirements structurally. Goal modeling mainly consists of two tasks: extraction of goals and organization of the extracted goals. Generally, the process of the goal modeling requires intensive manual intervention and higher modeling skills than the process of the usual requirements description. In order to mitigate this problem, we propose a method that provides systematic supports for constructing goal models. In the method, the requirement analyst answers questions and a goal model is semi-automatically constructed based on the answers made. We develop a prototype tool that implements the proposed method and apply it to two systems. The results demonstrate the feasibility of the method.

    Download PDF (1436K)
  • Yukasa MURAKAMI, Masateru TSUNODA, Koji TODA
    Article type: PAPER
    2020 Volume E103.D Issue 6 Pages 1319-1327
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    To enhance the prediction accuracy of the number of faults, many studies proposed various prediction models. The model is built using a dataset collected in past projects, and the number of faults is predicted using the model and the data of the current project. Datasets sometimes have many data points where the dependent variable, i.e., the number of faults is zero. When a multiple linear regression model is made using the dataset, the model may not be built properly. To avoid the problem, the Tobit model is considered to be effective when predicting software faults. The model assumes that the range of a dependent variable is limited and the model is built based on the assumption. Similar to the Tobit model, the Poisson regression model assumes there are many data points whose value is zero on the dependent variable. Also, log-transformation is sometimes applied to enhance the accuracy of the model. Additionally, ensemble methods are effective to enhance prediction accuracy of the models. We evaluated the prediction accuracy of the methods separately, when the number of faults is zero and not zero. In the experiment, our proposed ensemble method showed the highest accuracy, and Pred25 was 21% when the number of faults was not zero, and it was 45% when the number was zero.

    Download PDF (1271K)
Regular Section
  • Rongcun WANG, Shujuan JIANG, Kun ZHANG, Qiao YU
    Article type: PAPER
    Subject area: Software Engineering
    2020 Volume E103.D Issue 6 Pages 1328-1338
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Software fault localization, as one of the essential activities in program debugging, aids to software developers to identify the locations of faults in a program, thus reducing the cost of program debugging. Spectrum-based fault localization (SBFL), as one of the representative localization techniques, has been intensively studied. The localization technique calculates the probability of each program entity that is faulty by a certain suspiciousness formula. The accuracy of SBFL is not always as satisfactory as expected because it neglects the contextual information of statement executions. Therefore, we proposed 5 rules, i.e., random, the maximum coverage, the minimum coverage, the maximum distance, and the minimum distance, to improve the accuracy of SBFL for further. The 5 rules can effectively use the contextual information of statement executions. Moreover, they can be implemented on the traditional SBFL techniques using suspiciousness formulas with little effort. We empirically evaluated the impacts of the rules on 17 suspiciousness formulas. The results show that all 5 rules can significantly improve the ranking of faulty statements. Particularly, for the faults difficult to locate, the improvement is more remarkable. Generally, the rules can effectively reduce the number of statements examined by an average of more than 19%. Compared with other rules, the minimum coverage rule generates better results. This indicates that the application of the test case having the minimum coverage capability for fault localization is more effective.

    Download PDF (335K)
  • Yoshihiko OMORI, Takao YAMASHITA
    Article type: PAPER
    Subject area: Information Network
    2020 Volume E103.D Issue 6 Pages 1339-1354
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    In this paper, we propose homomorphic encryption based device owner equality verification (HE-DOEV), a new method to verify whether the owners of two devices are the same. The proposed method is expected to be used for credential sharing among devices owned by the same user. Credential sharing is essential to improve the usability of devices with hardware-assisted trusted environments, such as a secure element (SE) and a trusted execution environment (TEE), for securely storing credentials such as private keys. In the HE-DOEV method, we assume that the owner of every device is associated with a public key infrastructure (PKI) certificate issued by an identity provider (IdP), where a PKI certificate is used to authenticate the owner of a device. In the HE-DOEV method, device owner equality is collaboratively verified by user devices and IdPs that issue PKI certificates to them. The HE-DOEV method verifies device owner equality under the condition where multiple IdPs can issue PKI certificates to user devices. In addition, it can verify the equality of device owners without disclosing to others any privacy-related information such as personally identifiable information and long-lived identifiers managed by an entity. The disclosure of privacy-related information is eliminated by using homomorphic encryption. We evaluated the processing performance of a server needed for an IdP in the HE-DOEV method. The evaluation showed that the HE-DOEV method can provide a DOEV service for 100 million users by using a small-scale system in terms of the number of servers.

    Download PDF (1736K)
  • Tian XIE, Hongchang CHEN, Tuosiyu MING, Jianpeng ZHANG, Chao GAO, Shao ...
    Article type: PAPER
    Subject area: Artificial Intelligence, Data Mining
    2020 Volume E103.D Issue 6 Pages 1355-1361
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    In partial label data, the ground-truth label of a training example is concealed in a set of candidate labels associated with the instance. As the ground-truth label is inaccessible, it is difficult to train the classifier via the label information. Consequently, manifold structure information is adopted, which is under the assumption that neighbor/similar instances in the feature space have similar labels in the label space. However, the real-world data may not fully satisfy this assumption. In this paper, a partial label metric learning method based on likelihood-ratio test is proposed to make partial label data satisfy the manifold assumption. Moreover, the proposed method needs no objective function and treats the data pairs asymmetrically. The experimental results on several real-world PLL datasets indicate that the proposed method outperforms the existing partial label metric learning methods in terms of classification accuracy and disambiguation accuracy while costs less time.

    Download PDF (1546K)
  • Zizheng JI, Zhengchao LEI, Tingting SHEN, Jing ZHANG
    Article type: PAPER
    Subject area: Artificial Intelligence, Data Mining
    2020 Volume E103.D Issue 6 Pages 1362-1370
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    The joint representations of knowledge graph have become an important approach to improve the quality of knowledge graph, which is beneficial to machine learning, data mining, and artificial intelligence applications. However, the previous work suffers severely from the noise in text when modeling the text information. To overcome this problem, this paper mines the high-quality reference sentences of the entities in the knowledge graph, to enhance the representation ability of the entities. A novel framework for joint representation learning of knowledge graphs and text information based on reference sentence noise-reduction is proposed, which embeds the entity, the relations, and the words into a unified vector space. The proposed framework consists of knowledge graph representation learning module, textual relation representation learning module, and textual entity representation learning module. Experiments on entity prediction, relation prediction, and triple classification tasks are conducted, results show that the proposed framework can significantly improve the performance of mining and fusing the text information. Especially, compared with the state-of-the-art method[15], the proposed framework improves the metric of H@10 by 5.08% and 3.93% in entity prediction task and relation prediction task, respectively, and improves the metric of accuracy by 5.08% in triple classification task.

    Download PDF (673K)
  • Chen CHEN, Huaxin XIAO, Yu LIU, Maojun ZHANG
    Article type: PAPER
    Subject area: Artificial Intelligence, Data Mining
    2020 Volume E103.D Issue 6 Pages 1371-1379
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Pedestrian detection is a critical problem in computer vision with significant impact on many real-world applications. In this paper, we introduce an fast dual-task pedestrian detector with integrated segmentation context (DTISC) which predicts pedestrian location as well as its pixel-wise segmentation. The proposed network has three branches where two main branches can independently complete their tasks while useful representations from each task are shared between two branches via the integration branch. Each branch is based on fully convolutional network and is proven effective in its own task. We optimize the detection and segmentation branch on separate ground truths. With reasonable connections, the shared features introduce additional supervision and clues into each branch. Consequently, the two branches are infused at feature spaces increasing their robustness and comprehensiveness. Extensive experiments on pedestrian detection and segmentation benchmarks demonstrate that our joint model improves the performance of detection and segmentation against state-of-the-art algorithms.

    Download PDF (2385K)
  • Haoran LI, Binyu WANG, Jisheng DAI, Tianhong PAN
    Article type: PAPER
    Subject area: Artificial Intelligence, Data Mining
    2020 Volume E103.D Issue 6 Pages 1380-1387
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Homotopy algorithm provides a very powerful approach to select the best regularization term for the l1-norm minimization problem, but it is lack of provision for handling singularities. The singularity problem might be frequently encountered in practical implementations if the measurement matrix contains duplicate columns, approximate columns or columns with linear dependent in kernel space. The existing method for handling Homotopy singularities introduces a high-dimensional random ridge term into the measurement matrix, which has at least two shortcomings: 1) it is very difficult to choose a proper ridge term that applies to several different measurement matrices; and 2) the high-dimensional ridge term may accumulatively degrade the recovery performance for large-scale applications. To get around these shortcomings, a modified ridge-adding method is proposed to deal with the singularity problem, which introduces a low-dimensional random ridge vector into the l1-norm minimization problem directly. Our method provides a much simpler implementation, and it can alleviate the degradation caused by the ridge term because the dimension of ridge term in the proposed method is much smaller than the original one. Moreover, the proposed method can be further extended to handle the SVMpath initialization singularities. Theoretical analysis and experimental results validate the performance of the proposed method.

    Download PDF (362K)
  • Ruilin PAN, Chuanming GE, Li ZHANG, Wei ZHAO, Xun SHAO
    Article type: PAPER
    Subject area: Office Information Systems, e-Business Modeling
    2020 Volume E103.D Issue 6 Pages 1388-1394
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Collaborative filtering (CF) is one of the most popular approaches to building Recommender systems (RS) and has been extensively implemented in many online applications. But it still suffers from the new user cold start problem that users have only a small number of items interaction or purchase records in the system, resulting in poor recommendation performance. Thus, we design a new similarity model which can fully utilize the limited rating information of cold users. We first construct a new metric, Popularity-Mean Squared Difference, considering the influence of popular items, average difference between two user's common ratings and non-numerical information of ratings. Moreover, the second new metric, Singularity-Difference, presents the deviation degree of favor to items between two users. It considers the distribution of the similarity degree of co-ratings between two users as weight to adjust the deviation degree. Finally, we take account of user's personal rating preferences through introducing the mean and variance of user ratings. Experiment results based on three real-life datasets of MovieLens, Epinions and Netflix demonstrate that the proposed model outperforms seven popular similarity methods in terms of MAE, precision, recall and F1-Measure under new user cold start condition.

    Download PDF (2390K)
  • Daisuke SAITO, Nobuaki MINEMATSU, Keikichi HIROSE
    Article type: PAPER
    Subject area: Speech and Hearing
    2020 Volume E103.D Issue 6 Pages 1395-1405
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of multiple Gaussian mixture models (GMM). In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice GMM (EV-GMM) was proposed. In the EVC, a speaker space is constructed based on GMM supervectors which are high-dimensional vectors derived by concatenating the mean vectors of each of the speaker GMMs. In the speaker space, each speaker is represented by a small number of weight parameters of eigen-supervectors. In this paper, we revisit construction of the speaker space by introducing the tensor factor analysis of training data set. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the dimension of the mean vector and the Gaussian component. The speaker space is derived by the tensor factor analysis of the set of the matrices. Our approach can solve an inherent problem of supervector representation, and it improves the performance of voice conversion. In addition, in this paper, effects of speaker adaptive training before factorization are also investigated. Experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.

    Download PDF (909K)
  • Yi-ze LE, Yong FENG, Da-jiang LIU, Bao-hua QIANG
    Article type: PAPER
    Subject area: Image Processing and Video Processing
    2020 Volume E103.D Issue 6 Pages 1406-1413
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Metric learning aims to generate similarity-preserved low dimensional feature vectors from input images. Most existing supervised deep metric learning methods usually define a carefully-designed loss function to make a constraint on relative position between samples in projected lower dimensional space. In this paper, we propose a novel architecture called Naive Similarity Discriminator (NSD) to learn the distribution of easy samples and predict their probability of being similar. Our purpose lies on encouraging generator network to generate vectors in fitting positions whose similarity can be distinguished by our discriminator. Adequate comparison experiments was performed to demonstrate the ability of our proposed model on retrieval and clustering tasks, with precision within specific radius, normalized mutual information and F1 score as evaluation metrics.

    Download PDF (861K)
  • Guangyuan LIU, Daokun CHEN
    Article type: LETTER
    Subject area: Information Network
    2020 Volume E103.D Issue 6 Pages 1414-1418
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    Survivable virtual network embedding (SVNE) is one of major challenges of network virtualization. In order to improve the utilization rate of the substrate network (SN) resources with virtual network (VN) topology connectivity guarantee under link failure in SN, we first establishes an Integer Linear Programming (ILP) model for that under SN supports path splitting. Then we designs a novel survivable VN topology protection method based on particle swarm optimization (VNE-PSO), which redefines the parameters and related operations of particles with the embedding overhead as the fitness function. Simulation results show that the solution significantly improves the long-term average revenue of the SN, the acceptance rate of VN requests, and reduces the embedding time compared with the existing research results.

    Download PDF (265K)
  • Fengli SHEN, Zhe-Ming LU
    Article type: LETTER
    Subject area: Artificial Intelligence, Data Mining
    2020 Volume E103.D Issue 6 Pages 1419-1422
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    This Letter proposes a autoencoder model supervised by semantic similarity for zero-shot learning. With the help of semantic similarity vectors of seen and unseen classes and the classification branch, our experimental results on two datasets are 7.3% and 4% better than the state-of-the-art on conventional zero-shot learning in terms of the averaged top-1 accuracy.

    Download PDF (153K)
  • Zaiyu PAN, Jun WANG
    Article type: LETTER
    Subject area: Pattern Recognition
    2020 Volume E103.D Issue 6 Pages 1423-1426
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    A pre-trained deep convolutional neural network (DCNN) is adopted as a feature extractor to extract the feature representation of vein images for hand-dorsa vein recognition. In specific, a novel selective deep convolutional feature is proposed to obtain more representative and discriminative feature representation. Extensive experiments on the lab-made database obtain the state-of-the-art recognition result, which demonstrates the effectiveness of the proposed model.

    Download PDF (789K)
  • Danyang LIU, Ji XU, Pengyuan ZHANG
    Article type: LETTER
    Subject area: Speech and Hearing
    2020 Volume E103.D Issue 6 Pages 1427-1430
    Published: June 01, 2020
    Released on J-STAGE: June 01, 2020
    JOURNAL FREE ACCESS

    End-to-end (E2E) multilingual automatic speech recognition (ASR) systems aim to recognize multilingual speeches in a unified framework. In the current E2E multilingual ASR framework, the output prediction for a specific language lacks constraints on the output scope of modeling units. In this paper, a language supervision training strategy is proposed with language masks to constrain the neural network output distribution. To simulate the multilingual ASR scenario with unknown language identity information, a language identification (LID) classifier is applied to estimate the language masks. On four Babel corpora, the proposed E2E multilingual ASR system achieved an average absolute word error rate (WER) reduction of 2.6% compared with the multilingual baseline system.

    Download PDF (280K)
feedback
Top