The surge of social media use, such as Twitter, introduces new opportunities for understanding and gauging public mood across different cultures. However, the diversity of expression in social media presents a considerable challenge to this task of opinion mining, given the limited accuracy of sentiment classification and a lack of intercultural comparisons. Previous Twitter sentiment corpora have only global polarities attached to them, which prevents deeper investigation of the mechanism underlying the expression of feelings in social media, especially the role and influence of rhetorical phenomena. To this end, we construct an annotated corpus for multilingual Twitter sentiment understanding that encompasses three languages (English, Japanese, and Chinese) and four international topics (iPhone 6, Windows 8, Vladimir Putin, and Scottish Independence); our corpus incorporates 5,422 tweets. Further, we propose a novel annotation scheme that embodies the idea of separating emotional signals and rhetorical context, which, in addition to global polarity, identifies rhetoric devices, emotional signals, degree modifiers, and subtopics. Next, to address low inter-annotator agreement in previous corpora, we propose a pivot dataset comparison method to effectively improve the agreement rate. With manually annotated rich information, our corpus can serve as a valuable resource for the development and evaluation of automated sentiment classification, intercultural comparison, rhetoric detection, etc. Finally, based on observations and our analysis of our corpus, we present three key conclusions. First, languages differ in terms of emotional signals and rhetoric devices, and the idea that cultures have different opinions regarding the same objects is reconfirmed. Second, each rhetoric device maintains its own characteristics, influences global polarity in its own way, and has an inherent structure that helps to model the sentiment that it represents. Third, the models of the expression of feelings in different languages are rather similar, suggesting the possibility of unifying multilingual opinion mining at the sentiment level.
Ideally, tree-to-tree machine translation (MT) that utilizes syntactic parse trees on both source and target sides could preserve non-local structure, and thus generate fluent and accurate translations. In practice, however, firstly, high quality parsers for both source and target languages are difficult to obtain; secondly, even if we have high quality parsers on both sides, they still can be non-isomorphic because of the annotation criterion difference between the two languages. The lack of isomorphism between the parse trees makes it difficult to extract translation rules. This extremely limits the performance of tree-to-tree MT. In this article, we present an approach that projects dependency parse trees from the language side that has a high quality parser, to the side that has a low quality parser, to improve the isomorphism of the parse trees. We first project a part of the dependencies with high confidence to make a partial parse tree, and then complement the remaining dependencies with partial parsing constrained by the already projected dependencies. Experiments conducted on the Japanese-Chinese and English-Chinese language pairs show that our proposed method significantly improves the performance on both the two language pairs.