This research proposes a context-restricted Skip-gram model for acquiring synonyms by employing various properties of the context words. The original Skip-gram model learned the word vector of each target word by utilizing all the context words around it. In contrast, the proposed context-restricted Skip-gram model learns multiple word vector types of each target word by limiting the context words to those pertaining to specific parts of speech or those present at specific relative positions. The proposed method calculates the cosine similarities on multiple word vector types and combines these similarities using linear support vector machines. The proposed method has high interpretability because it is a weighted linear summation of simple models. The interpretability of the proposed method enables us to investigate the degree of influence for acquiring synonyms from various properties of the context words. Moreover, the proposed method has high extendability because the conditions of context restriction can be easily changed and added. Experimental results using actual Japanese corpora showed that the proposed method aggregating multiple context-restricted models achieved a higher performance than the previous single Skip-gram model. In addition, the estimated weights of various properties of the context words could appropriately elucidate some grammatical characteristics of the Japanese language.
The surge of social media use, such as Twitter, introduces new opportunities for understanding and gauging public mood across different cultures. However, the diversity of expression in social media presents a considerable challenge to this task of opinion mining, given the limited accuracy of sentiment classification and a lack of intercultural comparisons. Previous Twitter sentiment corpora have only global polarities attached to them, which prevents deeper investigation of the mechanism underlying the expression of feelings in social media, especially the role and influence of rhetorical phenomena. To this end, we construct an annotated corpus for multilingual Twitter sentiment understanding that encompasses three languages (English, Japanese, and Chinese) and four international topics (iPhone 6, Windows 8, Vladimir Putin, and Scottish Independence); our corpus incorporates 5,422 tweets. Further, we propose a novel annotation scheme that embodies the idea of separating emotional signals and rhetorical context, which, in addition to global polarity, identifies rhetoric devices, emotional signals, degree modifiers, and subtopics. Next, to address low inter-annotator agreement in previous corpora, we propose a pivot dataset comparison method to effectively improve the agreement rate. With manually annotated rich information, our corpus can serve as a valuable resource for the development and evaluation of automated sentiment classification, intercultural comparison, rhetoric detection, etc. Finally, based on observations and our analysis of our corpus, we present three key conclusions. First, languages differ in terms of emotional signals and rhetoric devices, and the idea that cultures have different opinions regarding the same objects is reconfirmed. Second, each rhetoric device maintains its own characteristics, influences global polarity in its own way, and has an inherent structure that helps to model the sentiment that it represents. Third, the models of the expression of feelings in different languages are rather similar, suggesting the possibility of unifying multilingual opinion mining at the sentiment level.
Ideally, tree-to-tree machine translation (MT) that utilizes syntactic parse trees on both source and target sides could preserve non-local structure, and thus generate fluent and accurate translations. In practice, however, firstly, high quality parsers for both source and target languages are difficult to obtain; secondly, even if we have high quality parsers on both sides, they still can be non-isomorphic because of the annotation criterion difference between the two languages. The lack of isomorphism between the parse trees makes it difficult to extract translation rules. This extremely limits the performance of tree-to-tree MT. In this article, we present an approach that projects dependency parse trees from the language side that has a high quality parser, to the side that has a low quality parser, to improve the isomorphism of the parse trees. We first project a part of the dependencies with high confidence to make a partial parse tree, and then complement the remaining dependencies with partial parsing constrained by the already projected dependencies. Experiments conducted on the Japanese-Chinese and English-Chinese language pairs show that our proposed method significantly improves the performance on both the two language pairs.
Social media texts are often written in a non-standard style and include many lexical variants such as insertions, phonetic substitutions, and abbreviations that mimic spoken language. The normalization of such a variety of non-standard tokens is one promising solution for handling noisy text. A normalization task is very difficult for the morphological analysis of Japanese text because there are no explicit boundaries between words. To address this issue, we propose a novel method herein for normalizing and morphologically analyzing Japanese noisy text. First, we extract character-level transformation patterns based on a character alignment model using annotated data. Next, we generate both character-level and word-level normalization candidates using character transformation patterns and search for the optimal path based on a discriminative model. Experimental results show that the proposed method exceeds conventional rule-based system in both accuracy and recall for word segmentation and POS (Part of Speech) tagging.