Body representation is one of the most fundamental issues for physical agents (humans, primates, and robots) to perform various kinds of tasks. This paper proposes a method that constructs cross-modal body representation from vision, touch, and proprioception. Tactile sensation, when the robot touches something, triggers the construction process of the visual receptive field for body parts that can be found by visual attention based on saliency map. Simultaneously, proprioceptive information is associated with this visual receptive field to realize the cross-modal body representation. The computer simulation result comparable to the activities of parietal neurons found in monkey is given and future issues are discussed.