Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Associative Document Search using a Probabilistic Document Clustering
MAKOTO IWAYAMATAKENOBU TOKUNAGA
Author information
JOURNAL FREE ACCESS

1998 Volume 5 Issue 1 Pages 101-117

Details
Abstract
This paper presents a hierarchical clustering algorithm called HBC (Hierarchical Bayesian Clustering) for associative document search which is retrieving similar documents to a given query document. A major issue in realizing an associative document search is its efficiency in searching similar documents. A straightforward exhaustive search takes O (N) search time. In this paper we discuss the use of cluster-based search in which a document collection is automatically organized into a binary cluster tree and a query document is then compared with each cluster rather than each document. By searching a cluster tree in the top down direction, search time can be reduced to O (log2N) on average. However since clustering algorithms adopted in previous cluster-based search frameworks used different similarity measure from that used in top down document searching, search accuraccy for these frameworks was not promissing. HBC, on the other hand, directly seeks the maximum search performance on the given document collection by maximizing the self recall for it. In an experiment using “Gendai yôgo no kisotisiki, ” we verified the advantage of our cluster-based search using HBC over the well known cluster-based search using Ward's method. Also in an experiment using “Wall Street Journal, ” we confirmed that cluster-based search using HBC is more noise tolerant than the exhaustive search.
Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top