We investigated the multisensory emotion perception from humanoid-robot. In the experiment, participants were presented with video clips containing emotional colored eyes and voice (Task 1) or body gesture and voice (Task 2) of the robot, which were either congruent or incongruent in terms of emotional content (e.g., a happy body gesture paired with a sad voice on an incongruent trial). Participants were instructed to judge the emotion of the robot as either happiness or sadness. We examined the proportion of responses based on visual or auditory cues for the robot’s expression. Results showed that participants relied more on auditory cues than on the visual cues in Task 1. However, this vocal superiority was not observed in Task 2. These results suggest that the multisensory emotion perception from the robot is different whether the cues are natural or artificial. We proposed a model for multisensory emotion perception from a robot.