Annotating a Driving Experience Corpus with Behavior and Subjectivity

Personality is the important internal framework that we use when we communicate with the others; thus, it is critical for vehicle-driver communication. Evaluating an individual’s personality requires associating behavior and subjectivity. To develop the vehicle-driver communication system that will be most applicable to actual driving settings, a corpus is necessary that includes colloquial and daily experiential expressions along with their emotions, polarity, and sentiments. In addition, the corpus must include a wide range of subjective measures, such as human judgements, perceptions, and cognitions during car driving. Thus, we construct a driving experience corpus (DEC) that constitutes 253 blog articles (7,831 sentences) with the following four manually annotated tags: (1) driving experience (DE), (2) other’s behavior (OB), (3) self-behavior (SB) and (4) subjectivity (SJ). In this paper, we describe the guidelines, corpus specification and agreement analysis between annotators. We identified three difficulties: the extended self, important information, and voice in mind. We conducted automatic annotation experiment on the corpus using Conditional Random Fields (CRF). The results indicated F-Scores of .768, .478, .534, and .749 for DE, OB, SB, and SJ, respectively, on the test set. Our error analysis reveals difficulties in


Introduction
Given the drastic improvements in the field of machine learning, agents and robots, systems share or play a role in tasks that used to be fully controlled by humans. Many companies are, for example, competing to realize automatic driving and are conducting test driving on roads. However, it is critical that systems understand individual human differences and predict their behavior, especially under the condition that a miss may bring about a serious fatal accident. Systems must catch the attention of the driver and make him take the appropriate action. Thus, it is important for systems to communicate with their users in accordance with their individual differences. Tapus et al. (2008), for example, demonstrated that participants were more engaged in rehabilitation exercises with robots that behaved and gave messages similarly to the personalities of the participants. We regard it as important that a system communicates with its user via a language based on his/her personality to bridge the inconsistencies between the system's requirements and the user's attention and actions.
A personality is defined as an "abstraction used to describe and explain the coherent patterning over time and space of affects, cognitions, desires and the resulting behaviors that an individual experiences and expresses" (Revelle & Condon, 2015, p. 70). We consciously or unconsciously evaluate others' personalities, predict their next behaviors or utterances and sometimes adjust our communication style as a result. Thus, understanding the human personality and related behaviors is quite critical for this system's implementation. The definition of a personality suggests two important things that a human or a system needs to understand: behaviors and their psychological reactions. In other words, behavior and subjectivity are the critical determinants.
A corpus designed for this goal is more critical than anything else, and it must include the following three characteristics: • Colloquial and daily expressions under actual driving experiences, • Subjectivity (opinions, perceptions, cogitations, thinking, and emotions), • Driving behaviors.
Thus, for the first step, we construct a Driving Experience Corpus (DEC) in Japanese, which consists of 253 blog articles (7,831 sentences).  Figure 1: Example of the annotation In this study, we report the process of corpus construction (Figure 2), including the problems Figure 2: Process of annotation that occurred and their solutions, the evaluation of the annotation, and an experiment to automatically tag a text using a Conditional Random Fieldsbased model trained on the DEC.

Related Works
In this section, we briefly introduce the related works on personality, driving and annotations.
In the field of Natural Language Processing, there are several studies that predict the personalities of authors from SNS (Social Network Services) texts both in English (e.g., Golbeck et al., 2011;Park et al., 2015;Plank & Dirky, 2011;Schwartz et al., 2013) and in Japanese (e.g., Nasukawa et al., 2016;Nasukawa & Kamijo, 2017). These studies, however, focus on the detection of authors' personalities, while our goal is to associate behaviors and personalities.

Traffic, Transportation and Driving
Previous studies related to driving in Japan basically focused only on driving actions, traffic rules and ontologies (e.g., Takayama et al., 2017;Kawabe et al., 2015;Suzuki et al., 2015;Taira et al., 2014). These datasets, however, are not publicly available. Furthermore, human-centered or human experiential foci are lacking.
Regarding annotated corpora in this domain, four Japanese corpora have been constructed: (1) the University of Tsukuba Corpus Tagged with

Tags
Definitions Driving Experience (DE) A car driving experience refers to an experience containing one or two of the following and it specifies the range of this experience: 1. Blog experiences including descriptions and impressions of the author's own driving scene, and 2. the blog authors themselves do not drive a car, but actually describe the scenes of other people driving a car and the experiences including the impressions of the scene.

Others Behavior (OB)
This tag refers to the actions, behaviors, and objects that are to be manipulated by humans who are not the authors. Self-Behavior (SB) This tag refers to the actions, behaviors and objects that are to be manipulated by the authors, including the present and the past, the date, the time, and the place. Subjectivity (SJ) This tag refers to the evaluation of the behaviors. It refers to the emotions, cognitions, thoughts, judgments, and predictions held by the author as a result of actions.  (2011) constructed the Kyoto University and NTT Blog Corpus of 249 blog articles (4,186 sentences) with sentiment information. These blogs were written by 81 university students in the following four themes: sight-seeing in Kyoto, cellphones, sports or gourmet food. Nakazawa et al. (2018) constructed a phrase-sentiment review corpus (59,758 phrases) with all manually annotated polarity information from the Tsukuba Corpus. These corpora do not focus on driving, thinking and cognition.
Our corpus is different from the existing ones in the following three perspectives: • It includes colloquial and daily expressions.
• It has a wide range of subjectivity, including emotions, polarity, sentiments, human judgements, perceptions, and cognitions.
• It covers driving experiences.

Guidelines
The guidelines were constructed by an annotation team. The members consisted of the first author, three experienced annotators, and the second author as a supervisor. With 11 driving experiential blog articles, we repeated the annotations, discussions, and revisions of the guidelines.

Definition of Tags
In this project, we focus on behaviors and subjectivity, especially when driving a car. To this end, we first tag a car driving experience (see the DE example) and annotate the experiences with the behavior and subjectivity tags. The detailed definitions are presented in Table 1.

Driving Experience Tag
To exclude texts that are unrelated to car driving experiences, such as events or opinions, we prepared the DE tag.

Behavior and Subjectivity Tags
The behavior and subjectivity tags are attached only within each DE range. In our guidelines, "behavior" refers to an observable behavior. People who are present in the same scene can "hear", "listen to", "see", or "look" in the same way as others do, and so it is considered that they can be recognized as in common. Behaviors are divided into two tags according to who does the behavior, the author or others, because the delineation is very important. The system needs to differentiate its user and others both in terms of safety and communication. The accidental risk may be notably high if the system fails to deliver the message accurately.
Meanwhile, psychological reactions or mental states such as thinking and feeling are all regarded as subjectivity. The behavior and subjectivity tags are basically annotated within one sentence. All the tagged units, however, change from a word to a clause or to phrases, depending purely on semantics ( Figure 3). The subsequent sentences indicate provide an example of one sentence with different units of annotation.

Difficult Cases
In the discussion process, we found three difficulties: (1) the extended self or we-ness, (2) important information, and (3) voice in mind. Extended self or we-ness: Different from usual situations, it is sometimes quite difficult to differentiate "me" and "you/ other people" when there are multiple people within a car.
In the following sentences, the author was NOT driving the car. She was just sitting at the passenger seat in the car while her husband was driving.
Although she did not have any control over the car, she was placed into the trouble due to no choice or fault of her own. Thus, we regarded this extended self as self-behavior. On the other hand, "screaming" in the last sentence is only what the author did, which is as originally defined as selfbehavior.
Important information: There were many cases in which the annotators were at a loss on or they hesitated to ignore several important information related to driving behaviors. Typically, hesitation results when times and weather conditions appear in the very beginning of the experiences far from the driving behavior. We know they often affect driving behaviors.
Another typical example is important incidents or information that affects the preceding or subsequent tags, even though such information, of course, appears in very close sentences. The water temperature gauge increasing is just a fact and not a behavior or subjectivity. However, it specifically led the author to have to put on their hazard lights and stop the car.
We solved this issue only when this important information that affects behaviors or subjectivity occurs just before or after the main tag. This very close information is easily connectable with the main behavior and subjectivity while the distant 32nd Pacific Asia Conference on Language, Information and Computation Hong Kong, 1-3 December 2018 Copyright 2018 by the authors 5 information is difficult to determine to what extent it influences the behavior and subjectivity. For the example sentences, we annotate the first sentence about water temperature gauge with the same tag as the following sentence, which is SB. Voice in mind: The texts in parentheses (utterances) were mostly included only when they have nominatives and/or verbs as behaviors. The first "embarrassing…" in the example sentence was not annotated with any tags. However, these expressions often appear in the blog writings, and so we annotated such expressions with the subjectivity tag as a voice in mind.

Corpus Annotation
Manual annotation was performed using the brat annotation tool (Stenetorp et al., 2012) by three annotators. We prepared the driving-related corpus using 10 million blog articles that were filtered using driving related words. After selecting 250 driving related blog articles out of the driving related corpus, the annotators started the actual annotation. For the first round, the 250 articles were divided into three piles and each annotator completed their parts. For the second round, the second annotator reviewed the annotated texts and corrected annotation errors if necessary. If there was a disagreement, they were discussed. These processes resulted in a total of 261 annotated articles (11 of them were those used for making the guidelines).

Inter-Annotator Agreement
To evaluate the guidelines, two annotators annotated 15 articles each. The Precision, Recall, and F-scores were calculated for all the tags only when all the DEs were matched (9 articles) to the following: (1) DE, (2) OB, (3) SB, and (4) SJ (Table 3). We used the tags that were annotated by the two annotators described in the previous section as the correct data to evaluate the agreement. The results indicate high agreement. Thus, the guidelines are properly constructed to allow for consistent annotations.

Classification
We used Conditional Random Fields (CRF) (Lafferty et al., 2001) to conduct the automatic annotation experiments using our annotated corpus.
For these experiments, CRF++ version 0.561 3 was used. CRF++ is a sequential tagger that requires a 32nd Pacific Asia Conference on Language, Information and Computation Hong Kong, 1-3 December 2018 Copyright 2018 by the authors 6 template file that specifies the combinations of features.

Feature Representations
The basic features that are fed to the CRF are the sets of word identities and parts of speech (POSs). In addition to these basic features, we prepared the following five features. Dependency Relations: Dependency relations are critical information in Japanese such that they were also included in the process. If a word is not the last one within its phrase, it is given its following word. If a word is the last one in its belonging phrase, it is given the content word in its parent phrase (Figure 4).  Driving Behavior Phrases (DBP): Using the DBWs, we extracted the driving behavior predicate phrases from three resources: (1) the same writings as mentioned above, (2) two driving license textbooks for rules and driving (Toyota Nagoya Education Center Inc., 1998, 2004 and (3) driving related blogs. For (1) and (2), we collected all of the predicates using predicate-argument structure analysis with the KNP 2 . For (3), we collected the predicates whose predicate or cases include DBWs. This approach resulted in 7,931 total phrases after the human evaluation. The list includes, for example, ハンドルを切る 'turn a steering wheel' and ブレーキを踏む 'step on the brake.'

Personality words (1):
We extended 109 personality related words (Iwai et al., 2017) to 487 words using the word2vec model that is trained with 200 million web-crawled Japanese sentences. These personality extended words are flagged only when zero anaphora resolution results indicated that the nominatives are human when using the Japanese dependency and case structure analyzer KNP 4 .
Personality words (2): We extended 142 Japanese personality related words (Iwai et al., 2017(Iwai et al., , 2018 to 665 words using the same word2vec model as Personality (1). 395 words were selected based on human evaluations, and they express more direct personality traits such as 心配 'anxious' and 同調 'agreeable.' Conversely, Personality (1) includes more suggestive ones, such as 文化 'culture' and 美術 'arts.' These words are flagged only when zero anaphora resolution results indicated that the nominatives are human when using KNP.

Experimental Setup
We divided the dataset of 253 articles into training, development and test data and conducted experiments. First, we used all of the features together for the training. After that step, we determined the parameters using the development data. Next, we applied the best model to the test data for the final evaluation.

Parameters
We adjusted two major parameters, the regularization algorithm and the hyperparameter C, to find the best model. Using the development data, we tested all the combinations of the two parameters (regularization algorithm: L1/L2 and hyperparameter: 0.001/0.01/0.1/1/10). The cutoff threshold of features was not adjusted because our datasets are not very large. We compared the performances and chose L1 and set C to 1.

Results and Discussion
In our first experiment, we compared the Fscores of each tag using the development data while increasing the training data step by step from 50 articles to 209 articles ( Figure 5)  We integrated OBs and SBs into one tag, integrated behavior (BH), and BH performed much better than OB and SB, respectively (from .543 to .635). This suggests that the CRF model has learned the behavioral texts but finds it difficult to achieve the correct nominatives. We applied the model to the test data, and Table  4 lists the performances on the development and test data. The scores of the development data are the same as presented in the results using the 209 articles in Figure 5. The test performance also indicated similar learning patterns to the development data. DE,SJ and BH (DE=.768,SJ=.749,and BH=.698) performed well compared with SB and OB.
Error Analysis. The error analysis on the development data revealed the following two interesting aspects: • Interpretation of the nominatives (IN), and • Factuality failure (FF). Both errors result because the system does not have the same ability as a human to read a sentence and comprehend its use, context and knowledge. When humans see and understand any driving behavior, humans regard them as extended others or self-behaviors. In the examples of IN1 and IN2, the CRF model classified the cars as human agents but did not correctly classify whether self or other behaviors could be attributed, although both texts include "I" information.
Another type of error involves the factuality evaluation in each experience. Human annotators read and understand texts with other information as a whole in the experiences and situations, while the CRFs measured the literal semantics. In FF1, '(I/ we) stop the vehicle and move forward after confirming our safety' seems to be a very typical writing that refers to an SB. In this instance, however, this writing provokes the author and readers into being cautious because he/she was involved in a car accident. Thus, the annotators tagged "SJ." FF2 is the opposite case of the SB tagged as an SJ. The expression "The sudden stop was miraculous." seems quite subjective if this sudden stop actually has not factually occurred. The author actually bluffed his/her way out of the car accident. Both cases require the system to use the preceding information.

Conclusions and Future Work
We constructed a Driving Experience Corpus by annotating behaviors and subjectivity. With the tagged textual data, we constructed a model that sought to understand driving scenes, behaviors as actors or observers and their subjectivity as humans recognize them, which helps the system to communicate with its users in a more colloquial and human-like manner. The annotation scheme and guidelines reflect the ideas and knowledge from psychology. High interannotator agreement shows that the guidelines are applied consistently.
The discrepancies in the process of constructing the guidelines are quite essential characteristics in Japanese blogs or SNS texts. Thus, our guidelines are also useful to those who prepare blogs and SNS texts in Japanese.
Moreover, it is important that the DEC targets driving experiences in Japanese and in any other language. The accurate interpretation of nominatives and factuality recognition are highly important in the context of the driving environment that entails accidental risks that sometimes lead to injuries and/or death. Understanding drivers' behaviors and subjectivity helps the system to evaluate their personality. Accurate personality evaluation enhances people's feelings of being understood (Oishi et al., 2012). Drivers' feelings of being understood by the system enhance the reliability and trustworthiness of the system, and improve the likelihood of following the system's messages or cautions.
For our future work, we are currently working on (1) expanding the corpus size with human annotations, (2) conducting further CRF experiments with more features such as modality, polarity, pronouns, zero-anaphora and emotions, and (3) conducting experiments with more recent machine learning methods such as Bidirectional LSTM-CRF (e.g., Huang, Xu, & Yu, 2015;Ma & Hovy, 2016;Reimers & Gurevych, 2017). The errors suggest that simple sequential tagging is not good enough to represent contexts that are located a far distance away.