Host: Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT)
Statistical data analysis using legacy databases often requires grouping of mentions that refer to the same real world entity. This type of pre-processing becomes particularly important when dealing with large-scale databases since there exist much variation of names that makes the cost for generating dictionaries or normalization rules infeasible high. Based on this, we investigate, in this paper, methods for automatic name matching and discuss the advantages and disadvantages of (i) a binary classifier which determines whether two mentions refer to the same entity or not and also (ii) a graph-based clustering algorithm which disambiguates two similar mentions using their global features.