One of well-known aspects of multisensory communication is auditory and visual integration in face-to-face speech perception as demonstrated in the McGurk effect in which heard speech is altered by mismatching visual mouth movements. The susceptibility to the McGurk effect varies depending on various factors including the intelligibility of auditory speech. Here I focus on the language background of perceivers as an influencing factor on the degree of use of visual speech. When the auditory speech is highly intelligible, native Japanese speakers tend to depend on auditory speech, showing less visual influence compared with native English speakers. Such interlanguage differences are not apparent at 6 years of age, but are developed by 8 years due to increasing visual influence in native English speakers. It seems that native English speakers developmentally acquire vigorous lipreading ability such that adult English speakers can lipread monosyllables faster than they can hear them, while such visual precedence is not observed in native Japanese speakers. This kind of interlanguage difference is being confirmed by event-related potentials and functional magnetic resonance imaging.
Previous studies indicated that looking to the right side facilitates the intermodal audiovisual matching of speech in infants (MacKain et al., 1983, Patterson & Werker, 1999). This study investigated the side bias effect for sounds containing rapidly changing temporal elements, i.e. a bilabial trill (BT). In Experiment 1, 8-month-old infants were presented with pairs of faces articulating a BT and a whistle (WL), respectively. The infants who listened to the BT showed successful audiovisual matching by orienting longer toward the sound-specified BT face only when the face was presented to the right side of the infants. This right-side bias was confirmed by the number of gaze shifts toward the BT face when it appeared in the right visual field. Such spatial asymmetry was not exhibited by the infants who listened to the WL, and the infants to whom no sound was presented. Experiment 2 tested 5-month-old infants on their audiovisual matching of the BT, and revealed that matching is acquired later than 5 months of age. The strong right-side bias found in Experiment 1 is argued from the perspective of left hemisphere dominance in the processing of the rapidly changing sound components of the BT.
Information from face and voice plays an important role in social communication. As shown in the study of speech perception, facial and vocal signals are integrated even in the perception of emotion. This paper reviews the studies on multisensory perception of emotion by faces and voices. This paper then introduces recent studies on the cultural differences in the multisensory perception of emotion. It is emphasized that the combination of faces and voices can yield the richness in the expressions of emotions.
Human beings perceive others' emotions in facial expressions and speech prosodies. Though they are of different modalities, they are inevitably interactive to make emotions to be perceived. We aimed to investigate a cross-modal modulation of emotions using facial expressions and non-sense emotive voices, and effects of varying strength and reliability of emotions included in the stimuli. We found that the perceived emotions in faces and voices were modulated in the direction of simultaneously presented but neglected voices and faces, respectively. This cross-modal modulation of emotions was more explicit when the dominant emotion included in the judged stimulus was consistent with the rated emotion, especially for voice ratings. The strength of emotions included in the judged stimulus had no effect on the cross-modal modulation. When the reliability of judged stimulus was deteriorated by applying a low-pass filter of spatial frequency on faces, the cross-modal modulation of emotions was no longer explicit even if the dominant emotion included in the judged stimulus was consistent with the rated emotion. These results suggest that the cross-modal modulation of emotions is not fully accounted by the weak fusion model of linear summations and that non-linear components on different emotions and congruencies of multi-modal emotions should be considered.
The ability to recognize emotional states of others is a fundamental social skill. In this study, we investigated the extent to which complex emotions can be inferred from facial or vocal cues in speech. Several sentences were prepared that intended to appreciate, blame, apologize, or congratulate others. Japanese university students uttered these sentences with congruent or incongruent emotional states, and they were recorded with a video camera. The speakers' friends and strangers were shown these videos in a single modality (face or voice only) and they were asked to rate the perceived emotional states of the speakers. The results showed that the raters discriminated congruent message conditions from incongruent message conditions, and that this discrimination largely depended on voice cues, rather than face cues. The results also showed that the effects of familiarity of target person modulated the way of inferring emotional states. These results suggest that we could detect subtle emotional nuances of others in spoken interaction, and that we use facial and vocal information in some different ways.
In interpersonal communication, vocal affect often reveals the speaker's relational attitudes. Because knowing partners' relational attitudes is crucial in subsequent social interaction, people may automatically allocate attention to vocal affect especially when they are relationally engaged. The present work examined cross-culturally whether automatic attention to vocal affect would be enhanced by a mere exposure to schematic faces, which are ubiquitous in social interaction and cues indicating social engagement. Japanese and American participants judged the verbal meaning of emotionally spoken emotional words while ignoring the vocal tone. Consistent with previous studies, interference by to-be-ignored vocal affect was significantly greater for Japanese than for Americans. Moreover, as predicted, it was also greater when participants were exposed to schematic human faces while listening to the stimulus utterance regardless of cultures, suggesting that attention to vocal tone increases in a much subtler, cross-cultural fashion and with mere exposure to schematic faces. Implications for future work are discussed.
An existing study in cognitive psychology found a phenomenon that humans recall less on contents uttered in a condition where there is an inconsistency between contents and ways of utterances on positivity and negativity. The aim of the research is to explore whether the similar phenomenon causes in Human-Robot Interaction. A psychological experiment (N=45) was conducted with 2 × 2 between-subject design of polite/impolite phrase of utterance from and postures of a small-sized humanoid robot. The results found that recall task scores of subjects for contents uttered by the robot and their evaluation scores of impression toward the robot were less in the conditions with inconsistency between the robot's phrase and posture on its politeness. The paper discusses some implications of the phenomena from the design perspective of interaction between robots and humans.
Sound symbolism is a phenomenon that voice itself evokes some images related to sensory experiences. Articulatory mediation hypothesis of sound symbolism underlines the cross-modal relationship between kinetic movements of the articulatory organs associated with voice pronunciation and following proprioceptive sensations. Despite the importance of this relationship to pursue the factor of the formation of sound symbolism, it remains a crucial problem with the lack of sufficient experimental proof for the connection between voice pronunciation and sensory experiences. Purpose of this paper is to demonstrate stimulus-response compatibility between pronunciation of voiced/voiceless consonants and brightness of visual stimuli with stimulus-response compatibility task. In the experiment, reaction time of congruent condition (e.g. brightness and pronunciation of voiceless consonants, darkness and pronunciation of voiced consonants) was significantly shorter than that of incongruent condition (e.g. darkness and pronunciation of voiceless consonants, brightness and pronunciation of voiced consonants). This result shows stimulus-response compatibility between pronunciation of voiced/voiceless consonants and brightness of visual stimuli. This compatibility provides an important foothold to consider about the relationship between voice pronunciation and sensory experiences as a factor of sound symbolism.
We can feel the texture through the visual modality as well as tactile sensations. In this study, we investigate how we feel tactile texture from the shades of the image by the experiment in which participants freely expressed the impression of texture images by onomatopoeia. We prepared visual texture images made by CG technique from 3D asperity information and lighting condition. We used description by onomatopoeia to evaluate the shaded textures as we considered that the statistical distribution of the consonants of onomatopoeia represented the tactile information of the surface. In order to compare the characteristics of the image with the texture evaluation by onomatopoeia, we also carried out texture image analysis by gray level co-occurrence matrix and examined what geometric components correspond to the texture of the image. As a result, it was revealed that participants had feelings of homogeneity for coarseness and smoothness of the texture image as well as friction for the strength of edge and roughness. Our results suggest that the representation of texture impression by the statistical distribution of consonants of onomatopoeia can express the qualitative evaluations for the shaded texture.
This study explores the hierarchical order in the directionality of synaesthetic metaphors by means of a psychological experiment. Previous studies present varied models of hierarchy among different sense modalities for synaesthetic metaphors. Among the unsettled questions, we address the issues of the separability and the positions of “temperature”, “shape”, and “taste” in synaesthetic metaphors in Japanese, as well as their directions of extension. We carried out a questionnaire-based comprehensibility test with 245 Japanese noun phrases. Each phrase consists of an adjective and a noun from seven modalities (temperature, touch, smell, taste, shape, color, sound). Our results show the following three points concerning synaesthetic phrases in Japanese, (i) the modality of temperature is distinguished from that of touch, and temperature is located at the left end of the scale in the model of directionality; (ii) the modality of shape is separable from color, and its position in the model of directionality is closer to touch than to color; (iii) the modality of taste does not become the target of extension from higher modalities like sound or color, and color does not become the source of extension to lower modalities like touch, taste, and smell. Cross-linguistic variations as well as similarities are found by comparing these results with previously hypothesized directionality models.
In this paper, we investigate the linguistic factors affecting production of gestures in discourse by using Japanese three-party conversation corpus. We develop statistical models that best predict the occurrence, and the viewpoint, of gestures in the corpus based on syntactic factors such as zero pronouns, person, onomatopoeia, and verb types as well as discourse factors such as backward-looking centers, preferred centers, and center-transition patterns. We show that not only syntactic but also discourse factors influence the occurrence of gestures, suggestingthat gesture production is affected by mental imagery at the level of utterance as well as discourse. We also show that the viewpoint of gestures, by contract, is affected only by syntactic factors, suggesting that different kinds of linguistic factors are involved in different aspects of gesture production.
There are some children with reading difficulties in the regular class. They seem to have some problems in their perceptual or cognitive skills involved with reading process. In this study, we developed the multi-media learning support system “Touch & Read” for assisting their reading process. The system can zoom up the text, highlight the line in it, and read out it to present the information auditorily. Introducing this system to the regular class, we investigated the way of the learning support for the children with reading difficulties. Ahead of the introduction, we conducted the test to survey the children's decoding skills and visuoperceptual functions and identified the causes of reading difficulties. We provided the Touch & Read to children for their learning in the regular class, and observed how the children with reading difficulties used the system. As a result, it was suggested that children could use the system to compensate their perceptual or cognitive skills and achieve more efficient learning outcomes.
We investigated risk perception as it appears to breast cancer patients and how it develops. Wealth of anecdotal evidence as well as interview with both medical professionals and patients led us to hypothesize that patients' risk attitude would develop from incipience, wherein they are preoccupied by optimism toward complete treatment, to recurrence, wherein they become less optimistic and accept the realistic need to cohabitate with the disease. One-hundred breast cancer patients were recruited via voluntary patients organizations in Japan. The participants responded to either a 5-paged web questionnaire or a paper-pencil survey to judge the likelihood of certain risky treatment incidents happening to them, such as bad treatment result, side effects of chemotherapy, recurrence, and medical accident. The result showed that breast cancer patients did not differ as a whole in their optimism on medical risk perception between the two treatment stages. However, the result differed depending on patients' current treatment status: only in the regular treatment (defined as seeing a doctor once or twice a month) group patients were more optimistic in their incipient stage when compared with their recurrence stage, while there was no such difference in no treatment (defined as receiving a follow-up examination once a year) group. There was no difference in their optimism on medical risk perception among formally (medically) categorized cancer stages. Also, confirmatory factor analysis revealed that there were three distinctive factors to which the patients' optimism on risk perception was divided, namely “recurrence,” “aggressive treatment,” and “medical accident.” Hence, we uncovered a possible structure underneath “medical risk perception,” as well as successfully replicated the result of the past study. Implications and possible extension are discussed.