アンケート感情分類における観点の上位概念化による観点集約方法

doi:10.50987/jsod.22.1_2

Abstract

This study investigates the optimal threshold for extracting only titles with the same contents of documents by calculating the similarity between vectors obtained from each title using NLP methods. A comparison of the similarity calculated by fastText, BERT, and SBERT methods revealed that the characteristics of documents with high similarity differ depending on the method. Therefore, the authors decided to use the harmonic mean of the similarities calculated by each method. The results showed that 141 (90%) of the 157 pairs of titles with a harmonic mean of 0.96 or higher had the same document content. This rate was higher than the results using the similarity calculated by each method.

Favorites & Alerts

Add to favorites
Additional info alert
Citation alert
Authentication alert

Register with J-STAGE for free!