2017 Volume 2017 Issue AM-17 Pages 04-
For the purpose of developing a dialogue system to dialogue after visually understanding the surrounding situation. We developed Japanese Caption generation system Deep Watcher and image datasets with captions. We used the Show and Tell model using CNN and LSTM to generate captions. We also evaluated the coincidence rate of caption content and five feature items manually. As a result the coincidence rate of the contents of the generated caption was 41.6%, the highest characteristic item was gender and was 86.9%. The coincidence rate of the caption contents were not high by over learning, but we could show the possibility of application to the dialog system for the feature item of gender.