We proposed a new cognitive visual model of movie recognition based on our previous findings of psychophysical phenomena. Our cognitive model suggested two important functions of movie recognition process. First, a continuous movie sequence was divided and perceived as serial event segments of short scenes. The movie would be coded for each of the segments and structuralized as a contextual association of each segments. Second, the knowledge structure of the context of previously viewed movies was used to predict the ongoing movie context and the online segmentation. We compared our cognitive model with a previously proposed theoretical model of movie processing. The results of our experiments supported our hypothesis: an adaptive learning mechanism of online movie segmentation would be effective for an intelligent knowledge-based structure of a future movie analysis system.
We present a new method for halftoning images by scanning the pixels in a random order and diffusing the errors by using a coefficient-reversed bilateral filter. The random traversal of pixels avoids creating artificial dot patterns that usually appear in conventional halftone images by raster scanning the pixels. The coefficient-reversed bilateral filter is effective at enhancing the edges. We apply this halftoning method to the non-photorealistic rendering of stippling images.
A novel method is proposed to provide exact prediction of the optical reconstruction of full-parallax computer generated holograms (CGHs). The wave field emitted from the CGH is numerically propagated into a pupil, and then the image formation by a lens is simulated by a procedure based on wave optics. Two numerical techniques for free space propagation, the shifted-Fresnel diffraction and rotational transformation, are used for the numerical calculation. Our method thus enables the viewpoint to be moved and the gaze point to be changed. This technique allows us to confirm the effect of non-diffraction light and the conjugate image of full-parallax large-scaled CGHs prior to their fabrication.
We developed a new signal processing technique for a 3D real-time imager that is based on the time-of-fight method. The use of a code-signal modulation technique enables us to solve some of the problems with the distance measurements. One problem is the ambiguous measurement caused by the cyclic repetition of the signal phase according to the propagation distance when using the conventional CW modulation technique. The second problem is caused by the mixture of signals reflected on the glass panel and at the object through the glass. The code-signal modulation technique with the noise filter excludes the ambiguous signals reflected from the background objects at long distances over the modulation period. In addition, the signal correction process extracts the intended signal reflected at the object through the glass. With this novel signal-processing the utility of the 3D real-time imager can be further extended.
We propose a newly developed roll-type optical memory (RoCAM). This memory is a multilayered memory and has a cylindrical shape. It consists of a recording and transparent layers wound onto a shaft. A RoCAM has five advantages. First, the media is easily fabricated. Second, the groove structures in RoCAM are easily implemented. Third, it has parallel recording and reading. Fourth, there is stable rotation and, finally, constant linear velocity. We report these advantageous features and the one-dimensional parallel signal readout of RoCAM.
It is widely believed that excessive binocular disparity produces diplopia without clear depth perception. However, recent studies have reported that diplopic images with a very large disparity appear clearly in depth when they move. It is currently unclear whether this facilitation of stereoscopic depth caused by target movement in diplopic images requires the involvement of both eyes. A monocular target stimulating the nasal or temporal retina of either eye appears in depth as if it has uncrossed or crossed disparity, respectively (i.e. monoptic depth). In the present study we examined the dynamic properties of monoptic depth and binocular stereopsis. Two small circular targets were presented 5 degrees above and below a fixation point and oscillated horizontally in counter phase. With binocular stereopsis, disparities with the same magnitude and opposite polarity were applied to the two targets. With monoptic depth, targets were removed for either eye. The results revealed that target motion facilitated binocular stereopsis but not monoptic depth. These findings suggest that corresponding target images stimulating both eyes are necessary for a depth of large magnitude to be perceived in motion, in spite of diplopia.
A recent finding showed that infants with autism do not tend to look into their mother's eyes. The disease can be diagnosed by examining the eccentricity of the infant's gaze distribution from the mother's eyes and showing a video of the mother's face on a display screen. In the present study, to cope with a lot of examination, we develop a system that immediately detects the pupil positions in the mother's face video provided by the color camera using our robust pupil detection technique. The system consists of one color camera and two monochrome cameras with near-infrared light sources. All cameras are calibrated. The monochrome camera determines the 3D positions of the mother's pupil centers, which are transformed into the coordinates in the color camera image. These processes were easily performed with 60 fps. In addition, the experimental results show the precise pupil center detection in the color face videos.
We developed a communication aid, called VUTE, which aids people who have difficulty in spoken communication such as elderly people with hearing impairments, deaf people and foreigners. We present an overview of VUTE 2010 that can be used at railway stations. This system displays picture symbols on a portable data terminal and prompts the user to select the appropriate symbol. Finally, it outputs sentences which correspond to the answers given by the user. We also carried out an evaluation test on seven subjects in their fifties or sixties. We confirmed that VUTE can output sentences that correspond to the situations which had been arranged in advance, without using written letters/characters or voices.
We investigated the effect of network delay on changing viewpoint in free-viewpoint video transmission by conducting a Quality of Experience (QoE) assessment. We address two transmission methods: synthesized image transmission and depth and image transmission. We assessed the image quality, interactivity of changing viewpoint, and comprehensive quality as QoE factors. The assessment results indicate that the image quality of the synthesized image transmission is higher than that of the depth and image transmission, which is advantageous in terms of interactivity. Also, because the inferior-to-superior relationship between the two methods depends on the characteristics of the video content and camera work used in the rendering process for the comprehensive quality, we should choose one suitable method from the two methods according to the situation.
We have been studying a real-time speech-to-caption system using speech recognition technology with a repeat speaking method. In this system, we used a repeat speaker who listens to a lecturer's voice and then speaks back the lecturer's utterances into a speech recognition computer. Our developing system showed that the accuracy of the captions is about 97% in Japanese-Japanese conversion, and the conversion time from voices to captions is about 4 seconds in English-English conversion in some international conferences. Of course it required a lot of costs to achieve these high performances. In human communications, speech understanding depends not only on verbal information but also on non-verbal information such as speaker's gestures and face and mouth movements. Therefore, we found a suitable way to display the information of captions and speaker's face movement images to achieve higher comprehension after briefly storing information once into a computer. In this paper, we investigated the relationship of the display sequence and display timing between captions that have speech recognition errors and the speaker's face movement images. The results showed that the sequence displaying the caption before the speaker's face image improved the comprehension of the captions. The sequence displaying both simultaneously showed an improvement of only a few percent higher than that of the question sentence, and the sequence displaying the speaker's face image before the caption showed almost no change. In addition, the sequence displaying the caption 1 second before the speaker's face showed the most significant improvement of all the conditions in the hearing-impaired.
The conflict between accommodation and vergence stimuli has been identified as a possible cause of visual fatigue from viewing stereoscopic images. We examined static and dynamic characteristics of accommodation and vergence responses while viewing stereoscopic displays and real objects to clarify the effect of stereoscopic images on the visual function. We used an instrument based on the Shack-Hartmann wavefront sensor to measure accommodation and vergence responses simultaneously. Accommodation responses to the static stereoscopic target with large binocular disparity were deviated from those to the real target, i.e., static characteristics. The step responses of the accommodation response showed considerable individual differences, i.e., dynamic characteristics. In addition, the asymmetries of step responses of accommodation were observed between the near-to-far and the far-to-near step directions. These results suggest that we need to examine both static and dynamic characteristics of accommodation and vergence responses to clarify the biological effect of stereoscopic images.
Conventional gaze tracking systems are burdensome in that they require the user to gaze at several targets on the PC screen in the user-calibration process. The proposed calibration procedure requires the user to gaze at only one target. The implemented system consists of four camera-calibrated, wide-view video cameras arranged around the screen, with near-infrared light-emitting diode (LED) lights attached to each camera. The angle θ between the line of sight and the line connecting the center of the pupil and the camera (LED lights) is related to the vector from the center of the pupil to the corneal reflection detected from the video image. The user-calibration process makes it possible to determine three parameters, which can be achieved using three of the four cameras. Usually, the larger that angle θ is, the worse the gaze detection precision is. A weighted mean method is proposed to determine the final precise gaze point.