WEB文書を対象にしたKWICシステム

関根 聡; 武田 善行; 吉平 健治

doi:10.5715/jnlp.12.4_245

Abstract

A KWIC (KeyWord In Context) system is a useful tool to investigate the usage oflanguage.We developed a KWIC system for a huge WEB text.The text data isextracted from about 350 giga byte WEB pages and contains more than 10 billioncharacters.It was done by a crawler for about 2month period.The amount of thetext data exceeds 4 giga bytes which can be expressed in 32 bits.We developed asuffix array indexer which can handle 40 bits and the system searches sentences withdesired keywords in it.In order to show the usefulness of the system for Japaneselearners as a second language, we collect KWIC data for “TO-ITAMU (painful like)” and analyzed onomatopoeia appear before the expression.

Content from these authors

Favorites & Alerts

Add to favorites
Additional info alert
Citation alert
Authentication alert

Corresponding author

Register with J-STAGE for free!