文字列をκ回以上含む文書数の計数アルゴリズム

梅村 恭司; 真田 亜希子

doi:10.5715/jnlp.9.5_43

Abstract

The statistics we compute is dfκ: the number of documents which contain certain strings more than κ times.We can hardly keep the statistics of all substrings because we need 0 (N²) space where N is the size of corpus.Yamamoto et al.show that it is possible to produce a table for κ=1 in 0 (N) space using Suffix Array and the concept of “class of string”.However, this method cannot solve the problem where κ≥2.We present an algorithm that can be used for κ≥2 and we can compute the statistics by using the table.In this report, we explain dfκ and compare the proposed algorithm with simple methods.This algorithm takes O (N log N) time and O (N) space to produce the table and O (log N) time to obtain statistics from the table.

Content from these authors

Favorites & Alerts

Add to favorites
Additional info alert
Citation alert
Authentication alert

Corresponding author

Register with J-STAGE for free!