Abstract
This paper presents a simple and fast algorithm for approximate string matching in which string similarity is computed by set similarity measures including cosine, Dice, Jaccard, or overlap coefficient. In this study, strings are represented by unordered sets of arbitrary features (e.g., tri-grams). Deriving necessary and sufficient conditions for approximate string, we show that approximate string matching is exactly solvable by τ-overlap join. We propose CPMerge algorithm that solves τ-overlap join efficiently by making use of signatures in query features and a pruning condition. In addition, we describe implementation considerations of the algorithm. We measure the query performance of approximate string matching by using three large-scaled datasets with English person names, Japanese unigrams, and biomedical entity/concept names. The experimental results demonstrate that the proposed method outperforms state-of-the-art methods including Locality Sensitive Hashing and DivideSkip on all the datasets. We also analyze the behavior of the proposed method on the datasets. We distribute SimString, a library implementation of the proposed method, in an open-source license.