2025 Volume 6 Issue 1 Pages 26-40
The category of Vision and Language includes multimodal understanding, which outputs recognition results from both visual and textual inputs, Image2Text, which generates text from visual input, and Text2Image, which generates visuals from text. Currently, research in this field is accelerating. One example from the authors’ research is the development of an AI robot that collaborates with humans to create and transcend knowledge. This requires building a scientific foundational model that learns from scientific literature, conducts experiments autonomously, and becomes smarter through discussions with researchers. Other research examples include studies on automating experimental procedures into manuals, AI-driven discovery of scientific laws and principles from data, and research on discovering new materials. In the discovery of new materials, two approaches are being explored, one of which involves creating generative AI that uses highly accurate decoders for generating crystal structures.