Journal of the Acoustical Society of Japan (E)
Online ISSN : 2185-3509
Print ISSN : 0388-2861
ISSN-L : 0388-2861
Building a Thai part-of-speech tagged corpus (ORCHID)
Virach SornlertlamvanichNaoto TakahashiHitoshi Isahara
Author information
JOURNAL FREE ACCESS

1999 Volume 20 Issue 3 Pages 189-198

Details
Abstract

ORCHID (Open linguistic Resources CHanelled toward InterDisciplinary research) is an initiative project aimed at building linguistic resources to support research in, but not limited to, natural language processing. Based on the concept of an open architecture design, the resources must be fully compatible with similar resources, and software tools must also be made available. This paper presents one result of the project, the construction of a Thai part-of-speech (POS) tagged corpus, which is a preliminary stage in the construction of a Thai speech corpus. The POS-tagged corpus is the result of collaborative research between the Communications Research Laboratory (CRL) in Japan and the National Electronics and Computer Technology Center (NECTEC) in Thailand, with technical support from the Electrotechnical Laboratory (ETL) in Japan. In this paper, we propose a new tagset, based on the results of a prior multilingual machine translation project. The corpus is annotated on three levels: the paragraph, sentence, and word levels. Text information is maintained in the form of the text information lines and the number lines, which are both utilized in data retrieval. Both word segmentation and POS tagging were carried out by way of a probabilistic trigram model. Rules for syllable demarkation were additionally used to reduce the number of candidates in computing tagging probabilities

Content from these authors
© The Acoustical Society of Japan
Previous article Next article
feedback
Top