Journal of Japan Society of Directories
Online ISSN : 2436-5629
Print ISSN : 1882-9252
Survey of Thresholds for Determining Document Identity Using NLP Models
JOURNAL OPEN ACCESS

2024 Volume 22 Issue 1 Pages 2-9

Details
Abstract
This study investigates the optimal threshold for extracting only titles with the same contents of documents by calculating the similarity between vectors obtained from each title using NLP methods. A comparison of the similarity calculated by fastText, BERT, and SBERT methods revealed that the characteristics of documents with high similarity differ depending on the method. Therefore, the authors decided to use the harmonic mean of the similarities calculated by each method. The results showed that 141 (90%) of the 157 pairs of titles with a harmonic mean of 0.96 or higher had the same document content. This rate was higher than the results using the similarity calculated by each method.
© 2024 Japan Society of Directories
Next article
feedback
Top