IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Special Section on Foundations of Computer Science — New Trends in Algorithms and Theory of Computation —
Scalable Detection of Frequent Substrings by Grammar-Based Compression
Masaya NAKAHARAShirou MARUYAMATetsuji KUBOYAMAHiroshi SAKAMOTO
Author information
JOURNAL FREE ACCESS

2013 Volume E96.D Issue 3 Pages 457-464

Details
Abstract

A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.

Content from these authors
© 2013 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top