Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
Achieving robots capable of understanding human language and autonomously determining actions based on it is a significant research challenge in the fields of robotics and machine learning. If robots can accurately grasp the intentions embedded in humans' abstract instructions and execute appropriate controls, it is expected that assistance to humans and task execution efficiency will greatly improve. In this paper, we propose a imitation learning method for robot control to autonomously determine actions based on human language instructions and goal images, named Vision-Language-conditioned Diffusion Policy (VLDP). Traditional language-based robot control methods have been inadequate in fully modeling the inherent ambiguity and polysemy present in human language. VLDP addresses this issue by extracting semantics from human language instructions and goal images through a visual language model and conditioning them on a Diffusion Policy. This enables the robot to generate multiple valid actions in response to instructions containing linguistic ambiguity. Experiments evaluate the success rate of action generation based on language instructions, the ability to adapt to unseen language instructions, and the multimodality of actions generated by the proposed method.