The research field of human like agents that are often represented by an animation character is becoming increasingly active in recent years. As the motion of such agents influences the users' impression, it is easy to expect that the ability of the human like agent to make appropriate gestures could improve the understandability of the utterance contents. The load of the content creator, however, increases if he/she needs to determine when and what gestures the agent should make. This paper attempts to estimate the appropriate gestures for a given utterance text using conditional random fields (CRF), which can be used to reduce the effort spent by contents creators. We create the dataset consisting of the utterance text and the corresponding gesture labels from the educational movie contents and construct a gesture-labeling model using CRF in a supervised learning manner. The estimation performance of appearing the gestures is evaluated and compared with the simple existing model. Especially, we focus on the metaphoric gesture, often representing an abstract concept. This is because the metaphoric gesture is expected to facilitate the users' understanding of the abstract concepts. We empirically confirmed that the proposed model can distinctly estimate the metaphoric and other gestures.