2024 Volume 2023 Issue SWO-062 Pages 10-
Recent multimodal platforms consist of modals centered on voice, text, images, and music. Text-to-image is often used, such as in anime character generation, and its quality is comparable to that of creators, and it is becoming an alternative to AI creators. Furthermore, Image-to-Video is also emerging. These are based on Text, and are gaining social acceptance. On the other hand, there are few attempts between Image-to-Music and Music-to-Image modals. Technically speaking, it can be thought of as a method that individually tokenizes multiple different data such as audio, text, images, and music, and performs multimodal understanding and generation in an autoregressive manner as a large-scale language model (LLM). One of the reasons why it becomes a black box is that it is disconnected from human senses, and it is important to understand it from the perspective of knowledge graphs and ontology. In this paper, we consider the interpretability of Image-to-Video and Image-to-Music, and provide future prospects.