Abstract
High-quality MT systems and cross-lingual information retrieval systems need largesized translation dictionaries.Automatic extraction of translation patterns from parallel corpora is an efficient and accurate way to automatically develop translation dictionaries, and various approaches have been proposed to achieve this.This paper presents a practical translation pattern extraction method where translation patterns based on co-occurrence frequency of word sequences between English and Japanese can be greedily extracted, and manual confirmation or extra linguistic resources, such as chunking information and translation dictionaries, can be also effectively combined with.This paper examines the method of extracting probable translation patterns in incremental steps by gradually enlarging a unit of segmentalized corpus, in order to reduce the time spent on pattern extraction.Our experiments using 8, 000 sentences showed that the proposed method achieved an accuracy of 89%for coverage of 85%while the existing method achieved only an accuracy of 40%for coverage of 79%, and this was further improved to an accuracy of 96% for coverage of 85%when combined with manual confirmation.Our experiments using 16, 000 sentences showed that the method of dividing a corpus in quarters could reduce the extraction time to 9 hours while the nondividing method required 16 hours.