Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
KWIC System on WEB Documents
SATOSHI SEKINEYOSHIYUKI TAKEDAKENJI YOSHIHIRA
Author information
JOURNAL FREE ACCESS

2005 Volume 12 Issue 4 Pages 245-252

Details
Abstract
A KWIC (KeyWord In Context) system is a useful tool to investigate the usage oflanguage.We developed a KWIC system for a huge WEB text.The text data isextracted from about 350 giga byte WEB pages and contains more than 10 billioncharacters.It was done by a crawler for about 2month period.The amount of thetext data exceeds 4 giga bytes which can be expressed in 32 bits.We developed asuffix array indexer which can handle 40 bits and the system searches sentences withdesired keywords in it.In order to show the usefulness of the system for Japaneselearners as a second language, we collect KWIC data for “TO-ITAMU (painful like)” and analyzed onomatopoeia appear before the expression.
Content from these authors
© The Association for Natural Language Processing
Previous article
feedback
Top