Abstract
This paper introduces research on building multimodal AI. Multimodal AI is an AI that can uniformly handle images, voice, text, and other environmental information. To build a multimodal AI, it is necessary to build a database and interface that can uniformly handle multimodal data. Multimodal data includes voice, text, and other environmental information data. Multimodal AI has experiences like humans and can make the same judgments and take the same actions as humans. Multimodal AI can make autonomous decisions in accordance with environmental conditions and objectives and makes decisions based on experience. To realize human multimodal AI, it is first necessary to create a database that can handle multimodal data.