Previous studies suggest that nonhuman animals form concepts that integrate information from multiple sensory modalities such as vision and audition. For instance, Adachi, Kuwahata, and Fujita (2007) demonstrated that dogs form auditory-visual cross-modal representation of their owner. However, whether such multi-modal concepts would expand to more abstract, or collective, ones remains unknown. To answer the question, we tested whether dogs were sensitive to congruence of human genders suggested by the voice and the face of an unfamiliar person. We showed to the dogs a photograph of a male or female human face on the monitor after playing a voice of a person either matching or mismatching in gender. Dogs looked at the photograph for longer duration when the auditory stimuli were incongruent than when they were congruent; that is, expectancy violation was suggested. This result suggests that dogs spontaneously associate auditory and visual information to form a cross-modal concept of human gender. This is the first report showing that cross-modal representation in nonhuman animals expands to an abstract social category.