Abstract
When sequential visual flashes are accompanied by either a lower or greater number of sequential auditory pulses, the perceived number of visual flashes is actually lower or greater than the actual number. These responses are termed 'flash fusion' or 'sound-induced flash illusion', respectively. Although the neural correlates and mathematical model for these illusions have been previously described, the specific underlying mechanisms are unknown. This study examined whether flash fusion occurs by top-down and bottom-up temporal captures of vision by audition. A unique flash with a luminance increment was used, and observers reported which 'illusory' flashes was the unique flash with the luminance increment. The unique flash was generally captured by the pulse in the temporal vicinity, suggesting a bottom-up temporal capture. However, when an auditory pulse was given a unique pitch, the unique flash was perceptually paired with the unique pitch, suggesting a feature-based temporal capture. Moreover, the pairing of audiovisual features disappeared when the temporal location of the unique pitch was difficult for the observer to anticipate. These data indicate that flash fusion is a consequence of both the top-down feature-based temporal capture and the bottom-up temporal capture.