Collection of Meta Information with User-Generated Question Answer Pairs and its Reflection for Improving Expressibility in Response Generation

This paper concerns the problem of realizing consistent personalities in neural conversational modeling by using user generated question-answer pairs as training data. Using the framework of role play-based question-answering, we collected single-turn question-answer pairs for particular characters from online users. Meta information was also collected such as emotion and intimacy related to question-answer pairs. We verified the quality of the collected data and, by subjective evaluation, we also verified their usefulness in training neural conversational models for generating responses reflecting the meta information, especially emotion.

details). In this method, fans of a particular character voluntarily provide question-answer pairs by playing the role of that character. It has been demonstrated that high-quality question-answer pairs can be efficiently collected and that dialogue systems that exhibit consistent personalities can be realized (Higashinaka et al. 2018).
This paper extends this work and aims to collect question-answer pairs for particular characters together with other pieces of information (called "meta information"), such as emotion and intimacy levels. The aim of collecting this additional data is to realize dialogue systems whose utterances can be controlled to reflect such meta information Zhou and Wang 2018;Song et al. 2019). This is a useful feature when we want to realize systems that are affective and can become more intimate as an interaction progresses (Zhou and Wang 2018). We verify the quality of the collected data and empirically show that conversational models that exhibit consistent personalities as well as meta information, especially emotion, can be successfully realized by using voluntarily provided user-generated question-answer pairs. Note that, in this paper, we regard personality as the character-ness of a real or fictional character.
In what follows, we first describe the idea of role play-based question-answering followed by our data collection of question-answer pairs and meta information. Then, in Section 3, we describe our approach for training conversational models that take the meta information into account. In Section 4, we describe experiments conducted to verify our approach; we performed both objective and subjective evaluations using the data collected for three characters. Finally, we summarize the paper and mention future work.

Data Collection using Role Play-based Question-answering
Role play-based question-answering (Higashinaka et al. 2013b(Higashinaka et al. , 2018 is a data collection framework in which multiple users (typically fans) play the role of a certain character and respond to questions by online users (who can also be fans). Since the fans are knowledgeable about the character and find it amusing to answer questions in the role of their favorite character and online users can ask various questions to their favorite character, this framework can motivate users to voluntarily provide dialogue data centering around a particular character. Higashinaka et al. (2018) showed that fans are highly motivated to provide data and that the collected data are of high quality to realize dialogue systems exhibiting consistent personalities.
In this study, we use this framework to collect question-answer pairs together with other pieces of meta information. More specifically, we collect emotion and intimacy levels from fans in addition to question-answer pairs.

Data collection including meta information
We collected dialogue data (single-turn question-answer pairs) for three famous characters in Japan: Ayase Aragaki (Ayase), Hime Tanaka (Hime), and Hina Suzuki (Hina). Ayase is a fictional character in the novel series "Ore no imouto ga konnani kawaii wakeganai" (My Little Sister Can't Be This Cute). Her character is often referred to as a "yandere." According to Wikipedia, Yandere characters are mentally unstable, incredibly deranged, and use extreme violence or brutality as an outlet for their emotions. Hime and Hina are virtual YouTubers and form a duo called "HIMEHINA." Hime's character is friendly, and Hina has a goofy and laid-back character.
Question-answer pairs were collected on the websites established in the characters' fan communities. Figures 1 and 2 show screenshots of the websites for Ayase and Hime, respectively. The website for Hina is identical to that for Hime except that the images used were those of Hina.
Users can ask the characters questions by using a text-field interface, and users who want to play the role of the characters can post answers. Users can post questions and answers at any time; that is, the interaction is asynchronous. Multiple answers can be posted to the same question.
In addition, users can input meta information at the same time when posting their answers. The meta data that we collected were of two kinds: Emotion is a label provided for an answer. This indicates the emotion behind the answer, such as angry or happy. The list of emotions is different for each character. There are 10 types of emotion labels for Ayase, including "Normal," "Stumped," and "Angry." The labels were decided on the basis of emotions that she exhibits in the novel series in which she appears. For Hime and Hina, we had slightly different emotion labels decided on the basis of their behavior on YouTube. We employed the notion of basic emotions for ease of annotation for online users (Ekman 1992).
Intimacy is a label provided for an answer. It indicates how close each respondent is feeling to the questioner, and its value is discrete, taking one integer from 1 (least intimate; intimacy level for a stranger) to 5 (most intimate; intimacy level for a family member).
Since the intimacy feature had not been developed when collecting data for Ayase, we collected only emotion labels for Ayase and collected both kinds of meta information for Hime and Hina.

Statistics of collected data
The statistics of collected question-answer pairs are shown in Table 2. There were 15,179, 12,746, and 10,739 pairs collected for Ayase, Hime, and Hina, respectively. They were collected  within a period shorter than one month, indicating that the framework is efficient for collecting dialogue data. Note that the users who provided data were not paid for their effort; the work was totally voluntary.
As for the meta information, distributions of emotion labels for Ayase are shown in Figure 3.
Those for Hime and Hina are shown in Figure 4. It can be seen that the emotion labels for Ayase are rather equally distributed, whereas "Joyful" seems dominant for Hime and Hina, representing their personality. The distributions of intimacy labels for Hime and Hina are shown in Figure 5.

Procedure
To confirm the quality of the collected question-answer pairs with online users, we conducted a subjective evaluation by using human judges who do not include the authors. We asked potential judges to answer a questionnaire on a five-point Likert scale (1: not knowledgeable, 5: very knowledgeable) for their knowledge levels about each character, and only used those whose selfdeclared knowledge level was three or more. As a result, eleven, nine, and nine judges participated in the evaluation of question-answer pairs for Ayase, Hime, and Hina, respectively.  The judges rated each answer by their degree of agreement to the following statements on a five-point Likert scale (1: strongly disagree, 5: strongly agree).
Naturalness : The answer is appropriate for the character's response.
Reflection : The answer reflects the meta information (emotion or intimacy).
When judging the naturalness, the judges were shown pairs of a question and a user-generated answer. 100 unique question-answer pairs were randomly selected from the collected questionanswer pairs for this evaluation.
When judging the reflection of meta information, the judges were shown a tuple of a question, the meta information, and the user-generated answer. 100 unique tuples were randomly selected from the collected data for this evaluation. As a control, we prepared 100 unique tuples with meta information randomly replaced with different meta information. Table 3 shows the evaluation results. In terms of naturalness, all three characters attained high scores. This shows that even though role play-based question-answering does not pay users for their efforts, it can be used to collect appropriate responses for characters. This conforms to the results shown in (Higashinaka et al. 2018). Here, we define characteristic responses as those having a naturalness score of 3 or more. Under this definition, the percentages of characteristic responses in the collected data for Ayase, Hime, and Hina were 86%, 99%, and 98%, respectively.

Results
This indicates that the responses are of high naturalness.
For the reflection of emotion, when we look at "Actual," we seem to have good quality emotion labels. When we compare "Actual" vs. "Random" (randomly replaced emotion labels), we have a good amount of drop, meaning that the utterances and emotions are well associated in our data. As for the intimacy, the results were different. Although the "Actual" scores were high, the results for "Random" also exhibited high scores (although with a slight drop), meaning that the utterances and intimacy levels are not as associated when compared with the case of emotion; it may be difficult for humans to accurately recognize the level of intimacy from utterances. Figure 6 shows the proportion of score differences between judges (i.e., score differences between all judge pairs) for naturalness and reflection of meta information. As for meta information,  Table 3 Results of human evaluation for collected question-answer pairs. Scores were averaged over all judges. Standard deviations are shown in parentheses.

Fig. 6
Score differences between judges we only used the results for "Actual" for this analysis. Overall, we can see that most score differences are below 1 or less, meaning that the decisions of the judges are similar to each other.

Training Conversational Models that Reflect Meta Information
To test whether it is possible to generate utterances that reflect meta information collected by role-play based question answering, using the collected question-answer pairs as well as meta information, we train neural conversational models. Figure 7 illustrates the overview of our training procedure. Since we want to generate utterances that reflect meta information, we adopt a model architecture that can take such additional information into account in decoding.
Below, we explain how we train such conversational models.

Pre-training with meta information
The amount of data collected for Ayase, Hime, and Hina may be too small to learn a generation model from scratch. Therefore, we decided to pre-train the models with a large amount of data. In this study, we used the large number of general question-answer pairs that we collected previously when developing our question answering system (Higashinaka et al. 2013a). This dataset, "General QA pairs," was collected via crowdsourcing. Crowdworkers were given topics and wrote questions and answers related to the topics; each human intelligence task (HIT) asked each worker to provide 10 question-answer pairs for a topic. There are 500K question-answer pairs in this dataset. For example, for the topic Mt. Fuji, the dataset includes question-answer pairs such as Q: "Do you know the height of Mt. Fuji?" and A: "It's 3,776 meters," Q: "Is Mt.
Fuji the highest mountain in Japan?" and A: "Yes, it is." Meta information (i.e., emotion and intimacy labels) is not included in General QA pairs, which may be problematic in the later fine-tuning stage because pre-training without meta information might lead to models that ignore the meta information. Therefore, we trained classifiers for meta information from our data for Ayase, Hime, and Hina and automatically annotated General QA pairs with such meta information. We call the dataset annotated in this way "Autoannotated general QA pairs." By performing pre-training using Auto-annotated general QA pairs, a model is likely to take into account the meta information appropriately when fine-tuning with the data with meta information.

Generation models
Currently source text in the source language and a tentative machine translation result for that text as additional information. It then outputs target text in the target language. We use this model for our generation models. In this study, as additional information, we used meta information instead of tentative machine translation results.

Classifiers for meta information
We need classifiers for meta information for creating Auto-annotated general QA pairs. To realize the classifiers, we created BERT-based classifiers with an additional multi-layer perceptron (MLP) layer, using the representation encoded by BERT as input. We used bert-base-

Models for comparison
We trained our conversational models and evaluated their performance. We trained three models for comparison: wo-Meta: pre-trained using General QA pairs (without meta information) and fine-tuned without meta information using role play-based QA pairs. w-Meta: pre-trained using General QA pairs (without meta information) and fine-tuned with meta information using role play-based QA pairs. w-Meta+Anno: pre-trained using Auto-annotated general QA pairs (with meta information) and fine-tuned with meta information using role play-based QA pairs. We assumed that, by comparing the results for wo-Meta and w-Meta, we could check whether the meta information collected during role play-based question-answering was useful in generating responses that reflect the meta information. By comparing w-Meta and w-Meta-Anno, we could check whether the automatic annotation of meta information was useful for pre-training models.
Note that the aim of this paper is to verify whether utterances that reflect meta information can be generated with user-generated question-answer pairs. We used OpenNMT-APE 5 for training the models with default parameters. OpenNMT-APE implements the dual-source BERT encoderdecoder model that allows for the incorporation of additional information. We used the same BERT and sentencepiece models as in Section 3.3. The collected data were randomly split into train, development, and test sets with the ratio of 8/1/1 in this experiment. Tables 7 and 8 show the results of the automatic evaluation against the test data for emotion and intimacy, respectively. We used perplexity, distinct-1,2, and BLEU-1,2,3,4 as evaluation metrics (Liu et al. 2016). Perplexity measures the adequacy of language models. Distinct metrics measure the diversity of expressions in generated utterances, and BLEU metrics measure the accuracy of generated utterances in terms of lexical overlaps with references.

Procedure
To assess the quality of the generated responses, we conducted a subjective evaluation. The procedure for this evaluation was the same as that in Section 2.3.1. We used judges whose selfdeclared knowledge level about the characters was three or more. As a result, eleven, twelve, and twelve judges participated in the evaluation of question-answer pairs for Ayase, Hime, and Hina, respectively. Note that we used a different set of judges from the experiment in Section 2.3.1.
The judges rated each output answer by their degree of agreement to the following statements on a five-point Likert scale (1: strongly disagree, 5: strongly agree).
Naturalness: The answer is appropriate for the character's response.

Reflection:
The answer reflects the meta information (emotion or intimacy).
When judging the naturalness, the judges were shown pairs of an input question and an answer output by each of the generation models. The input to the models was the 100 questions (with meta information when the model requires it) randomly sampled from the test data. The answers for a question to be evaluated were randomly shuffled and presented. We asked the judges to evaluate each output independently and give the same score if the generated response to a question was the same.
When judging the reflection of meta information, the judges were shown a tuple of an input question, the meta information, and the output answer. We prepared two sets of 100 questions as input to the models. One set comprised random samples from the test data. The other set also comprised random samples from the test data, but unseen meta information was used; this set was created by artificially replacing the meta information with different meta information.
For instance, when a tuple had an input question ("Do you like Hime!?") and the meta information ("Favorable"), the meta information was forcibly replaced with another piece of meta information (e.g., "Surprised"), which was randomly selected from labels excluding the original label ("Favorable"). By using this set, it was possible to test the robustness of the models for unseen (possibly discrepant) meta information. We call the former condition "Seen" and the latter "Unseen." Table 9 shows the results of the human evaluation for naturalness. Among the three models we trained, w-Meta had the best score. However, the differences of the models were small and not statistically significant. Here we used a Wilcoxon rank-sum test as a statistical test with Bonferroni correction. The three models seemed to show the same level of naturalness, which is good for w-Meta+Anno because this means it can achieve higher distinct scores without a loss of naturalness.

Results for naturalness
The naturalness of generated responses needs to be sufficiently high in order to guarantee the naturalness of utterances that incorporate meta information such as emotion and intimacy. Table 3 shows that, in the collected data, the naturalness of the responses for Ayase, Hime, and Hina were 3.62, 3.97, and 3.99, respectively. Since the data evaluated here were those made by humans, the figures are considered to be the upper bound of the naturalness. The naturalness of the responses generated by the proposed models (w-Meta and w-Meta+Anno) were 3.33, 3.76, and 3.68, respectively, at the lowest, as shown in Table 9. These scores reached 90% or more in relation to the upper bound. This high percentage indicates that the naturalness of the generated responses was sufficiently high. Table 10 shows the results of the human evaluation for the reflection of emotion. We can see that w-Meta+Anno performed the best. For Ayase and Hime, w-Meta+Anno significantly outperformed wo-Meta. For Hina, we did not see a significant difference but the performance was better than wo-Meta. This indicates the effectiveness of our pre-training with Auto-annotated general QA pairs, at least for emotion. When comparing the results of "Seen" and "Unseen," the scores for "Seen" were generally higher than those for "Unseen," which is reasonable, indicating  Table 10 Results of human evaluation for reflection of emotion. Scores were averaged over all judges. Asterisks (**) indicate whether the value is significantly better than that of "wo-Meta" (p < 0.01). Wilcoxon rank-sum test was used as statistical test with Bonferroni correction.

Results for reflection of emotion
the difficulty of handling a combination of utterances with unseen emotions. Tables 11, 12 and 13 show human evaluation scores for each emotion label for Ayase, Hime and Hina, respectively. Here, we used their "Seen" data for analysis. Note that the numbers of samples for each emotion label are different representing the distribution of the original data. As for Ayase, w-Meta+Anno shows a significant improvement in "Angry" and "Happy" over wo-Meta. Concerning Hime, w-Meta+Anno significantly outperformed wo-Meta in "Surprised" and "Angry." These two characters seem to have achieved their overall improvement by successfully exhibiting several core emotions.    Table 13 Results of human evaluation for reflection of each emotion for Hina. Scores were averaged over all judges. See Table 11 for the notation of the table.

Fig. 8
Percentages of unique tokens for each emotion. "Gold" indicates the gold responses for test data and "Generated" indicates the generated responses for test data.
performance of Hina is presumably because Hina had few characteristic phrases to learn, whereas Ayase and Hime had some typical phrases related to specific emotions (e.g, "I will call the police!" ("Angry" for Ayase) and "W-What is that!?" ("Surprised" for Hime)). Figure 8 shows the percentages of unique tokens found in the responses for each emotion label. With regards to "Angry" of Ayase and "Surprised" for Hime, the percentage of "Generated" is much lower than that of "Gold". This suggests that the models learned typical phrases for these emotions. In contrast, Hina tends to have smaller drops on all emotion labels, suggesting that the model have not obtained the typical expression of each emotion. Table 14 shows the results of the human evaluation for the reflection of intimacy. We did not observe w-Meta and w-Meta+Anno to be superior. It can be observed that w-Meta is almost the same as wo-Meta and w-Meta+Anno is slightly worse than wo-Meta . Tables 15 and 16 show human evaluation scores for each intimacy label for Hime and Hina, respectively. Here, we used their "Seen" data for analysis. The differences of the models were small and not statistically significant for any intimacy label for both characters. Figure 9 shows the percentages of unique tokens for each intimacy label. Both characters showed a tendency to have small drops in all intimacy labels. It is assumed that their generation models could not learn typical expressions associated with each intimacy label. Another reason may be that it is difficult for humans to correctly recognize the level of intimacy from utterances alone, as discussed in Section 2.4.

Summary and Future Work
The purpose of this study was to verify whether a natural utterance can be realized for a character reflecting meta information (emotion and intimacy) from question-answer pairs and meta information obtained by role play-based question-answering. We performed experiments using the data of three famous characters in Japan. When training the generation models, we proposed the use of pre-training with data automatically annotated with meta information, which generally led to better performance.
As for emotion, subjective evaluation results indicate that utterances reflecting meta information can be generated although some differences were found depending on the characters; it seems difficult for a character who lacks typical expressions to generate appropritate responses reflecting emotion. We also found that it was difficult to generate responses that reflect intimacy.
Although we require future investigation, we attribute this to the lack of typical phrases related to intimacy and the way humans recognize intimacy in utterances.
For future work, we want to improve the quality of our generation models since there seems to be much room for improvement when compared to human performance. It may also be interesting to apply other pre-training methods (Yang et al. 2019;Liu et al. 2020) as well as to incorporate knowledge of the characters in question (Ghazvininejad et al. 2018) in order to enhance the character-ness of the generated utterances. We also want to examine the relationship between the naturalness of a generated response and the degree to which the meta information can be reflected. Although we work on the character-ness of a real or fictional character, we also want to apply our method to personality in general in the future. It is of our particular interests to investigate whether other kinds of meta-information can be collected and used for response generation. Realizing workable dialogue agents based on our work is one of our next steps, which would require the consideration of emotion/intimacy over multiple turns.
Finally, it should be noted that our proposed method has the risk of being used without the approval of copyright holders. For the appropriate development and application of the proposed method, appropriate procedures for using the characters need to be taken in order to avoid legal and ethical problems. (Received August 1, 2020) (Revised November 8, 2020) (Accepted December 12, 2020)