Deep learning has been applied to optical music sheet recognition (OMR). However, OMR processing from various sheet-music images still lacks precision to be widely applicable. We propose a measure-based multimodal deep-learning-driven assembly (MMdA) method enabling end-to-end OMR processing from various images including inclined photo images. Using this method, measures are extracted using a deep-learning model, aligned, and resized to be used for inference of given musical-symbol components by using multiple deep-learning models in sequence or in parallel. The use of each standardized measure enables efficient training of the deep-learning models and accurate adjustment of five staff lines in each measure, which enables locally inclined sheet-music images to be precisely positioned. Thus, a score can be reproduced from the inclined image with the proposed MMdA method while current OMR applications cannot. Multiple musical-symbol-component deep-learning feature-category models with a small number of feature types can represent a diverse set of notes and other musical symbols including chords. The proposed MMdA method provides a solution to end-to-end OMR processing and enhances the utility of OMR of mobile phonebased sheet-music photo images.
Local governments have high expectations for dialogue systems that provide residents and visitors with dialogue on multiple tasks, such as tourist information, town office information, and daily life information. In addition, there is also an expectation for such dialogue systems to be given the personalities of characters possessed by local governments and to provide chat dialogues for residents. However, to give the personality of a character to the dialogue system, task dialogue data and dialogue manager that reflect the personality are essential. Based on the study of role play-based question-answering dialogue systems that reproduce characters, we propose a method for collecting dialogue data that enables task dialogue and a method of dialogue manager and response selection that facilitates task dialogue. We constructed a dialogue system that imitated a certain local government character using the proposed method and evaluated its effectiveness through a laboratory experiment and demonstration experiment. The results showed that our dialogue system had statistically better performance in both task-oriented dialogue and chat oriented one.
Natural language inference (NLI) in the legal domain is the task of predicting entailment between the premise, i.e. law, and the hypothesis, which is a statement regarding a legal issue. Current state-of-the-art approaches to NLI with pre-trained language models do not perform well in the legal domain, presumably due to a discrepancy in the level of abstraction between the premise and hypothesis and the convoluted nature of legal language. Some of the difficulties specific to the legal domain are that 1) the premise and hypothesis tend to be extensive in length; 2) the premise comprises multiple rules, and only one of the rules is related to the hypothesis. Thus only small fractions of the statements are relevant for determining entailment, while the rest is noise, and; 3) the premise is often abstract and written in legal terms, whereas the hypothesis is a concrete case and tends to be written with more ordinary vocabulary. These problems are accentuated by the scarcity of such data in the legal domain due to the high cost.
Pretrained language models have been shown to be effective on natural language inference tasks in the legal domain. However, previous methods do not provide an explanation for the decisions, which is especially desirable in knowledge-intensive domains such as law.
This study proposes to leverage the characteristics of legal texts and decomposes the overall NLI task into two simpler sub-steps. Specifically, we regard the hypothesis as a pair of a condition and a consequence and train a conditional language model to generate the consequence from a given premise and a condition. The trained model can be regarded as a knowledge source for generating a consequence given the query consisting of the premise and the condition. After that, when the model receives an entailment example, it should generate a consequence similar to the original consequence, and when it is a contradiction example, they should be dissimilar since the model is trained on entailment examples only. Then, we compare the generated consequence and the consequence part of the hypothesis to see whether they are similar or dissimilar by training a classifier. Experimental results on datasets derived from the Japanese bar exam show significant improvement in accuracy from prior methods.
In Internet advertising, text information is added to increase the appeal of the ad to the viewers. However, some of the advertising documents contain inappropriate expressions. Wording or expressions that exaggerate the efficacy of a product or that recommend a product by a medical professional may violate the Pharmaceutical Affairs Law and the Act against Unjustifiable Premiums and Misleading Representations. Therefore, a system that can effectively and quickly detect problematic advertisements is required. Some advertisements cannot be properly classified based on word statistics alone. Therefore, information other than word statistics must be embedded in the document vector. The advertising documents targeted in this study have characteristics such as “biases in the word positions of specific words” and “periodic occurrence of specific words.” Frequently appearing words in problematic documents (especially in cosmetics advertisements) have strong biases in their word positions, resulting in a complex multimodal distribution of position of occurrence. Therefore, embedding word order information and word period information in document vectors is considered very effective for identifying problematic advertising documents.
In recent years, the effectiveness of the BERT model has been recognized in various natural language processing tasks. However, it is also true that faster models are required for application on the Internet advertising. Therefore, as a means of achieving both inference speed and discrimination performance, we propose a document feature based on the discrete Fourier transform(DFT) of word vectors weighted by an index previously proposed in a study that attempted to categorize Chinese Internet advertisements. In addition, we employed the Complex-valued Support Vector Machines as discriminative models that can handle complex numbers and have high generalization performance even with small amounts of data.
Although the discrimination performance of the proposed model is inferior to that of ALBERT and BERT to some extent, it is higher than that of DistilBERT, XGBoost, and LightGBM. The inference speed of the proposed model is somewhat slower than XGBoost and LightGBM and needs improvement, but is faster than DistilBERT. Those results indicate that the proposed model is promising when applied on the Internet. In addition, we found that when the index proposed in the previous study (which attempted to categorize Chinese advertisements) was applied to Japanese advertisements, that index emphasized the word vectors of specific nouns and verbs.
The ability to understand surrounding environment compositionally by decomposing it into its individual components is important cognitive ability. Human beings decompose arbitral entities into some parts based on its semantics or functionality, and recognize those parts as “object”. Such kind of object recognition ability is fundamental to planning. Recently, researches called “scene interpretation” have been conducted using deep generative models. Those researches build models that are able to recognize environment compositionally. The objective of this paper is to extend scene interpretation methods. Application of existing methods are restricted to simple images, and could not deal with complex images such as real images and heavily textured images. This is because previous works are done in fully-unsupervised manner, and the objective function is just minimizing reconstruction error. Therefore, in this case, models have no clues about objects unlike models leveraging supervised information, or inductive bias. In this research, we propose a method to decompose scenes as intended using minimum auxiliary information to identify objects. We build a model that utilizes background as auxiliary information to separate representation of background and foreground, and then we show our method is able to deal with datasets that are difficult for existing methods.
This paper aims to generate expressive speech for integration with a robot and AI character dialogue systems. To generate expressive speech, some researchers have proposed using labels that express specific dialogue acts and emotions (i.e., speaking style information). Our approach is to use the speaking style information as an intermediate representation and to train a model for inferring the speaking style information from the text and a speech synthesis model independently. Using a model that infers speaking style information from text, we construct a method that can generate expressive speech for text in the dialogue domain, outside the scope of speech synthesis training. The method first estimates the labels corresponding to the speaking style information for the input text. Then, the estimated labels and the input text are used to generate speech using a speech synthesis model. Experiments show that our method effectively improves the accuracy of text classification of speaking style labels. Subjective evaluation experiments show that our method can produce more expressive speech than conventional methods.
In this paper, we report the results of an investigation of the effects of compliment expressions by a conversational agent based on positive evaluative words and evidence of compliments. There are four types of expressions for the compliments: explicit expressions that include positive evaluation words, implicit expressions that do not include them, and concrete expressions that have a basis for them. The way in which compliments are received and the expressions preferred differ depending on the receiver. We investigated the relationship between the form of expression of compliments with different characteristics and the recipients. 500 participants in an online experiment watched videos of the conversational agents interacting and evaluated the dialogue videos as if they were interacting with the conversational agent. The results showed that those who received implicit compliments with rationales perceived the conversational agent as significantly more intelligent than those who received explicit compliments. The rationale for the complement was more effective than the explicit compliment among females. The implicit compliment with rationale was positively correlated with higher scores of extraversion and openness. However, it had the opposite effect on those with high neuroticism scores.
An explainable model is proposed to automatically explain when and what kind of behaviors affected interlocutors’ impressions in group discussions. Focusing on self-reported scores on the impressions such as atmosphere and enjoyment felt during the discussions, we tackle a new problem that identifies both influential behaviors and their timings that were related to impression formation. To that end, this paper formulates the problem as the identification of behavioral features, which contributed to the impression prediction and the detection of the timings when such behaviors occurred very frequently. To solve this problem, this paper presents a two-fold framework consisting of a prediction model using random forest regressors, followed by an explanation model using a SHAP analysis. The prediction part employs functional head-movement features, which can capture the various aspects of interactive conversational behaviors. The key feature of the latter explanation part is the temporal decomposition of features’ contributions by integrating the SHAP values and the temporal distributions of the occurrence probabilities of the behavioral features, which were computed by a kernel density estimation method from the detected samples of head movements and their functions. To derive this process, we exploit the dual additivity in the SHAP values and the functional head-movement features over time. Finally, the influential behaviors and their timings are identified from the temporal local maximums of the features’ contribution distributions. Targeting 17-group 4-female discussions included in a SA-Off corpus, this paper shows some case studies and the results of quantitative evaluations, which compared the reported overall scores and two-minute-wise scores at the timings found by the model. These results confirmed the predictability and explainability of the proposed model.
In recent years, various applications such as semantic search, question answering, and dialogue systems using large-scale knowledge graphs such as Wikidata and DBpedia have been studied. This study focuses on Question Answering over Knowledge Graphs (KGQA). To promote and evaluate KGQA studies, multiple question answering data sets based on large-scale knowledge graphs have been constructed. CSQA is a question answering benchmark based on Wikidata. CSQA targets interactive question answering, and there are 10 question types, some of which require inference, such as comparison and set operation. In the previous study, a multi-task semantic parsing model that converts utterances to logical forms that express operations on a database can answer with high accuracy. The conversion to a logical form is performed by defining the grammar for deriving the logical form and predicting the order in which the grammar is applied from the utterance. To learn a model for the prediction, it is necessary to search for a logical form that can answer the question from logical forms that can be generated by applying the grammar. However, since the search space is enormous, depending on the search method, problems such as a decrease in the success rate of the search and incorrect logical forms occur, which may adversely affect the learning of the model. In this research, we propose a method for searching logical forms with a high success rate of the search in a short time by using patterns of logical forms. The proposed method consists of logical form pattern search unit, logical form search unit, and logical form determination unit. The logical form pattern search unit searches for a logical form pattern that can generate a logical form that can answer questions. The logical form search unit searches only logical forms that can be created from logical form patterns obtained. The logical form determination unit decides which logical form to use for learning. For each logical form pattern, count the number of questions that can be answered by the logical form generated from that pattern. For each logical forms searched for each question, select a logical form generated from the logical form pattern with the largest number. In the evaluation, we searched the logical format for each of the 10 question types in the dataset used, with 5 000 questions each, and compared the search success rate with the search time. We also compared the accuracy of question answering when learning the existing system using the searched questions. The search success rate increased for 9 out of 10 question types in the dataset used in the experiment. The search time was reduced for 7 out of 10 question types. The question answering accuracy improved for 8 out of 10 question types.
A novel machine learning framework is proposed to automatically recognize the synergetic functions of the Aizuchi and the head movements of listeners in conversations. The listeners’ head movements, such as nodding, and Aizuchi, i.e., listeners’ short back-channel utterances, play a variety of functions, such as expressing the sign of listening, agreement, and emotions. This paper presents a functional Aizuchi corpus and analyzes it with the functional head-movement corpus that the authors have previously created. The analysis reveals the synergetic relationship between Aizuchi and head movements including reinforcement, multiplexing, and complementary. Then, this paper defines a functional category system called synergetic functions, which classifies reinforcement and multiplexing as product functions and complementary as sum functions. Next, several models using convolutional neural networks (CNNs) are designed to recognize such synergetic functions from the time series of the prosodic features and the head pose of the listeners. More specifically, we compare some different architectures, which employ early/late feature fusions and single/two-stage decision-making. The experimental results shows the proposed models achieved the maximum F1-score of 0.71 for the product function of Aizuchi’s continuer and head movement back-channel and that of 0.88 for a sum function called back-channel acknowledgment that was complementarily expressed by head movements and Aizuchi. These results confirms the potential of the proposed framework.