Abstract
This study investigates the optimal threshold for extracting only titles with the same contents of documents by calculating the similarity between vectors obtained from each title using NLP methods. A comparison of the similarity calculated by fastText, BERT, and SBERT methods revealed that the characteristics of documents with high similarity differ depending on the method. Therefore, the authors decided to use the harmonic mean of the similarities calculated by each method. The results showed that 141 (90%) of the 157 pairs of titles with a harmonic mean of 0.96 or higher had the same document content. This rate was higher than the results using the similarity calculated by each method.