詳細検索結果

Language-independent Approach to High Quality Dependency Selection from Automatic Parses

Gongye Jin, Daisuke Kawahara, Sadao Kurohashi

自然言語処理
2014年 21 巻 6 号 1163-1182
発行日: 2014/12/15
公開日: 2015/03/15

DOI https://doi.org/10.5715/jnlp.21.1163

ジャーナルフリー

抄録を表示する抄録を非表示にする

Many knowledge acquisition tasks are tightly dependent on fundamental analysis technologies, such as part of speech (POS) tagging and parsing. Dependency parsing, in particular, has been widely employed for the acquisition of knowledge related to predicate-argument structures. For such tasks, the dependency parsing performance can determine quality of acquired knowledge, regardless of target languages. Therefore, reducing dependency parsing errors and selecting high quality dependencies is of primary importance. In this study, we present a language-independent approach for automatically selecting high quality dependencies from automatic parses. By considering several aspects that affect the accuracy of dependency parsing, we created a set of features for supervised classification of reliable dependencies. Experimental results on seven languages show that our approach can effectively select high quality dependencies from dependency parses.
抄録全体を表示

PDF形式でダウンロード (3005K)
Spatial Hierarchical Attention Network Based Video-guided Machine Translation

Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi

Journal of Information Processing
2023年 31 巻 299-307
発行日: 2023年
公開日: 2023/05/15

DOI https://doi.org/10.2197/ipsjjip.31.299

ジャーナルフリー

抄録を表示する抄録を非表示にする

Video-guided machine translation, as one type of multimodal machine translation, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pre-trained action detection models as motion representations of the video to solve the verb sense ambiguity and neglect the noun sense ambiguity problem. To address this, we propose a video-guided machine translation system using both spatial and motion representations. For the spatial part, we propose a hierarchical attention network to model the spatial information from object-level to video-level. We investigate and discuss spatial features extracted from objects with pre-trained convolutional neural network models and spatial concept features extracted from object labels and attributes with pre-trained language models. We further investigate spatial feature filtering by referring to corresponding source sentences. Experiments on the VATEX dataset show that our system achieves a 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method. Experiments on the How2 dataset further verify the generalization ability of our proposed system.

抄録全体を表示

PDF形式でダウンロード (2242K)
A Comprehensive Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation

Chenhui Chu, Raj Dabre, Sadao Kurohashi

Journal of Information Processing
2018年 26 巻 529-538
発行日: 2018年
公開日: 2018/07/15

DOI https://doi.org/10.2197/ipsjjip.26.529

ジャーナルフリー

抄録を表示する抄録を非表示にする

Neural machine translation (NMT) has shown very promising results when there are large amounts of parallel corpora. However, for low resource domains, vanilla NMT cannot give satisfactory performance due to overfitting on the small size of parallel corpora. Two categories of domain adaptation approaches have been proposed for low resource NMT, i.e., adaptation using out-of-domain parallel corpora and in-domain monolingual corpora. In this paper, we conduct a comprehensive empirical comparison of the methods in both categories. For domain adaptation using out-of-domain parallel corpora, we further propose a novel domain adaptation method named mixed fine tuning, which combines two existing methods namely fine tuning and multi domain NMT. For domain adaptation using in-domain monolingual corpora, we compare two existing methods namely language model fusion and synthetic data generation. In addition, we propose a method that combines these two categories. We empirically compare all the methods and discuss their benefits and shortcomings. To the best of our knowledge, this is the first work on a comprehensive empirical comparison of domain adaptation methods for NMT.

抄録全体を表示

PDF形式でダウンロード (782K)
Constrained Partial Parsing Based Dependency Tree Projection for Tree-to-Tree Machine Translation

Chenhui Chu, Yu Shen, Fabien Cromieres, Sadao Kurohashi

自然言語処理
2017年 24 巻 2 号 267-296
発行日: 2017/03/15
公開日: 2017/06/15

DOI https://doi.org/10.5715/jnlp.24.267

ジャーナルフリー

抄録を表示する抄録を非表示にする

Ideally, tree-to-tree machine translation (MT) that utilizes syntactic parse trees on both source and target sides could preserve non-local structure, and thus generate fluent and accurate translations. In practice, however, firstly, high quality parsers for both source and target languages are difficult to obtain; secondly, even if we have high quality parsers on both sides, they still can be non-isomorphic because of the annotation criterion difference between the two languages. The lack of isomorphism between the parse trees makes it difficult to extract translation rules. This extremely limits the performance of tree-to-tree MT. In this article, we present an approach that projects dependency parse trees from the language side that has a high quality parser, to the side that has a low quality parser, to improve the isomorphism of the parse trees. We first project a part of the dependencies with high confidence to make a partial parse tree, and then complement the remaining dependencies with partial parsing constrained by the already projected dependencies. Experiments conducted on the Japanese-Chinese and English-Chinese language pairs show that our proposed method significantly improves the performance on both the two language pairs.

抄録全体を表示

PDF形式でダウンロード (1753K)
Constrained Partial Parsing Based Dependency Tree Projection for Tree-to-Tree Machine Translation

Chenhui Chu, Yu Shen, Fabien Cromieresy, Sadao Kurohashi

Information and Media Technologies
2017年 12 巻 172-201
発行日: 2017年
公開日: 2017/09/15

DOI https://doi.org/10.11185/imt.12.172

ジャーナルフリー

抄録を表示する抄録を非表示にする

Ideally, tree-to-tree machine translation (MT) that utilizes syntactic parse trees onboth source and target sides could preserve non-local structure, and thus generatefluent and accurate translations. In practice, however, firstly, high quality parsers forboth source and target languages are difficult to obtain; secondly, even if we havehigh quality parsers on both sides, they still can be non-isomorphic because of theannotation criterion difference between the two languages. The lack of isomorphismbetween the parse trees makes it difficult to extract translation rules. This extremelylimits the performance of tree-to-tree MT. In this article, we present an approachthat projects dependency parse trees from the language side that has a high qualityparser, to the side that has a low quality parser, to improve the isomorphism of theparse trees. We first project a part of the dependencies with high confidence to makea partial parse tree, and then complement the remaining dependencies with partialparsing constrained by the already projected dependencies. Experiments conductedon the Japanese-Chinese and English-Chinese language pairs show that our proposedmethod significantly improves the performance on both the two language pairs.

抄録全体を表示

PDF形式でダウンロード (1766K)
Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

Mo Shen, Daisuke Kawahara, Sadao Kurohashi

自然言語処理
2016年 23 巻 3 号 235-266
発行日: 2016/06/15
公開日: 2016/09/15

DOI https://doi.org/10.5715/jnlp.23.235

ジャーナルフリー

抄録を表示する抄録を非表示にする

Chinese word segmentation is an initial and important step in Chinese language processing. Recent advances in machine learning techniques have boosted the performance of Chinese word segmentation systems, yet the identification of out-of-vocabulary words is still a major problem in this field of study. Recent research has attempted to address this problem by exploiting characteristics of frequent substrings in unlabeled data. We propose a simple yet effective approach for extracting a specific type of frequent substrings, called maximized substrings, which provide good estimations of unknown word boundaries. In the task of Chinese word segmentation, we use these substrings which are extracted from large scale unlabeled data to improve the segmentation accuracy. The effectiveness of this approach is demonstrated through experiments using various data sets from different domains. In the task of unknown word extraction, we apply post-processing techniques that effectively reduce the noise in the extracted substrings. We demonstrate the effectiveness and efficiency of our approach by comparing the results with a widely applied Chinese word recognition method in a previous study.

抄録全体を表示

PDF形式でダウンロード (408K)
Parallel Sentence Extraction Based on Unsupervised Bilingual Lexicon Extraction from Comparable Corpora

Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi

自然言語処理
2015年 22 巻 3 号 139-170
発行日: 2015/06/16
公開日: 2015/12/14

DOI https://doi.org/10.5715/jnlp.22.139

ジャーナルフリー

抄録を表示する抄録を非表示にする

Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract parallel sentences from them for SMT. Parallel sentence extraction relies highly on bilingual lexicons that are also very scarce. We propose an unsupervised bilingual lexicon extraction based parallel sentence extraction system that first extracts bilingual lexicons from comparable corpora and then extracts parallel sentences using the lexicons. Our bilingual lexicon extraction method is based on a combination of topic model and context based methods in an iterative process. The proposed method does not rely on any prior knowledge, and the performance can be improved iteratively. The parallel sentence extraction method uses a binary classifier for parallel sentence identification. The extracted bilingual lexicons are used for the classifier to improve the performance of parallel sentence extraction. Experiments conducted with the Wikipedia data indicate that the proposed bilingual lexicon extraction method greatly outperforms existing methods, and the extracted bilingual lexicons significantly improve the performance of parallel sentence extraction for SMT.
抄録全体を表示

PDF形式でダウンロード (1859K)
Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

Mo Shen, Daisuke Kawahara, Sadao Kurohashi

Information and Media Technologies
2016年 11 巻 181-212
発行日: 2016年
公開日: 2016/12/15

DOI https://doi.org/10.11185/imt.11.181

ジャーナルフリー

抄録を表示する抄録を非表示にする

Chinese word segmentation is an initial and important step in Chinese language processing. Recent advances in machine learning techniques have boosted the performance of Chinese word segmentation systems, yet the identification of out-of-vocabulary words is still a major problem in this field of study. Recent research has attempted to address this problem by exploiting characteristics of frequent substrings in unlabeled data. We propose a simple yet effective approach for extracting a specific type of frequent substrings, called maximized substrings, which provide good estimations of unknown word boundaries. In the task of Chinese word segmentation, we use these substrings which are extracted from large scale unlabeled data to improve the segmentation accuracy. The effectiveness of this approach is demonstrated through experiments using various data sets from different domains. In the task of unknown word extraction, we apply post-processing techniques that effectively reduce the noise in the extracted substrings. We demonstrate the effectiveness and efficiency of our approach by comparing the results with a widely applied Chinese word recognition method in a previous study.

抄録全体を表示

PDF形式でダウンロード (777K)

J-STAGEへの登録はこちら（無料）