JSAI Technical Report, Type 2 SIG
Online ISSN : 2436-5556
Examining the interpretability of minor modals in multimodal foundation model
Hiromitsu OTA
Author information
RESEARCH REPORT / TECHNICAL REPORT FREE ACCESS

2024 Volume 2023 Issue SWO-062 Pages 10-

Details
Abstract

Recent multimodal platforms consist of modals centered on voice, text, images, and music. Text-to-image is often used, such as in anime character generation, and its quality is comparable to that of creators, and it is becoming an alternative to AI creators. Furthermore, Image-to-Video is also emerging. These are based on Text, and are gaining social acceptance. On the other hand, there are few attempts between Image-to-Music and Music-to-Image modals. Technically speaking, it can be thought of as a method that individually tokenizes multiple different data such as audio, text, images, and music, and performs multimodal understanding and generation in an autoregressive manner as a large-scale language model (LLM). One of the reasons why it becomes a black box is that it is disconnected from human senses, and it is important to understand it from the perspective of knowledge graphs and ontology. In this paper, we consider the interpretability of Image-to-Video and Image-to-Music, and provide future prospects.

Content from these authors
© 2024 Authors
Previous article Next article
feedback
Top