Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Counting documents that contain substrings more than κ times
KYOJI UMEMURAAKIKO SANADA
Author information
JOURNAL FREE ACCESS

2002 Volume 9 Issue 5 Pages 43-70

Details
Abstract
The statistics we compute is dfκ: the number of documents which contain certain strings more than κ times.We can hardly keep the statistics of all substrings because we need 0 (N2) space where N is the size of corpus.Yamamoto et al.show that it is possible to produce a table for κ=1 in 0 (N) space using Suffix Array and the concept of “class of string”.However, this method cannot solve the problem where κ≥2.We present an algorithm that can be used for κ≥2 and we can compute the statistics by using the table.In this report, we explain dfκ and compare the proposed algorithm with simple methods.This algorithm takes O (N log N) time and O (N) space to produce the table and O (log N) time to obtain statistics from the table.
Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top