This paper presents a hierarchical clustering algorithm called HBC (Hierarchical Bayesian Clustering) for associative document search which is retrieving similar documents to a given query document. A major issue in realizing an associative document search is its efficiency in searching similar documents. A straightforward exhaustive search takes
O (N) search time. In this paper we discuss the use of cluster-based search in which a document collection is automatically organized into a binary cluster tree and a query document is then compared with each cluster rather than each document. By searching a cluster tree in the top down direction, search time can be reduced to
O (log
2N) on average. However since clustering algorithms adopted in previous cluster-based search frameworks used different similarity measure from that used in top down document searching, search accuraccy for these frameworks was not promissing. HBC, on the other hand, directly seeks the maximum search performance on the given document collection by maximizing the self recall for it. In an experiment using “Gendai yôgo no kisotisiki, ” we verified the advantage of our cluster-based search using HBC over the well known cluster-based search using Ward's method. Also in an experiment using “Wall Street Journal, ” we confirmed that cluster-based search using HBC is more noise tolerant than the exhaustive search.
View full abstract