Multimodal learning is generally expected to make more accurate predictions than text-only analysis. Here,although various methods for fusing multimodal inputs have been proposed for sentiment analysis tasks, we foundthat they may be inhibiting their fusion methods, which are based on attention-based language models, from learningnon-verbal modalities, because non-verbal ones are isolated from the linguistic semantics and contexts and do notinclude them, meaning that they are unsuitable for applying attention to text modalities during the fusion phase. Toaddress this issue, we propose Word-Aware Modality Stimulation Fusion (WA-MSF) for facilitating integration ofnon-verbal modalities with the text modality. The Modality Stimulation Unit layer (MSU-layer) is the core conceptof WA-MSF; it integrates language contexts and semantics into non-verbal modalities, thereby instilling linguisticessence into these modalities. Moreover, WA-MSF uses aMLP in the fusion phase in order to utilize spatial andtemporal representations of non-verbal modalities more effectively than transformer fusion. In our experiments, WAMSFset a new state-of-the-art level of performance on sentiment prediction tasks.