Recognition of sarcasm in microblogging is important in a range of NLP applications, such as opinion mining. However, this is a challenging task, as the real meaning of a sarcastic sentence is the opposite of the literal meaning. Furthermore, microblogging messages are short and usually written in a free style that may include misspellings, grammatical errors, and complex sentence structures. This paper proposes a novel method for identifying sarcasm in tweets. It combines two supervised classifiers, a Support Vector Machine (SVM) using N-gram features and an SVM using our proposed features. Our features represent the intensity and contradictions of sentiment in a tweet, derived by sentiment analysis. The sentiment contradiction feature also considers coherence among multiple sentences in the tweet, and this is automatically identified by our proposed method using unsupervised clustering and an adaptive genetic algorithm. Furthermore, a method for identifying the concepts of unknown sentiment words is used to compensate for gaps in the sentiment lexicon. Our method also considers punctuation and the special symbols that are frequently used in Twitter messaging. Experiments using two datasets demonstrated that our proposed system outperformed baseline systems on one dataset, while producing comparable results on the other. Accuracy of 82% and 76% was achieved in sarcasm identification on the two datasets.
Patent claim sentences, despite their legal importance in patent documents, still pose difficulties for state-of-the-art statistical machine translation (SMT) systems owing to their extreme lengths and their special sentence structure. This paper describes a method for improving the translation quality of claim sentences, by taking into account the features specific to the claim sublanguage. Our method overcomes the issue of special sentence structure, by transferring the sublanguage-specific sentence structure from the source language to the target language, using a set of synchronous context-free grammar rules. Our method also overcomes the issue of extreme lengths by taking the sentence components to be the processing unit for SMT. An experiment demonstrates that our proposed method significantly improves the translation quality in terms of RIBES scores by over 25 points, in all of the four translation directions i.e., English-to-Japanese, Japanese-to-English, Chinese-to-Japanese and Japanese-to-Chinese directions. Alongside the improvement in RIBES scores, improvements of approximately five points in BLEU scores are observed for English-to-Japanese and Japanese-to-English directions, and that of 1.5 points are observed for Chinese-to-Japanese and Japanese-to-Chinese directions.
Through using knowledge bases, question answering (QA) systems have come to be able to answer questions accurately over a variety of topics. However, knowledge bases are limited to only a few major languages, and thus it is often necessary to build QA systems that answer questions in one language based on an information source in another language (cross-lingual QA: CLQA). Machine translation (MT) is one tool to achieve CLQA, and it is intuitively clear that a better MT system improves QA accuracy. However, it is not clear whether an MT system that is better for human consumption is also better for CLQA. In this paper, we investigate the relationship between manual and automatic translation evaluation metrics and CLQA accuracy by creating a data set using both manual and machine translation, and performing CLQA using this created data set. As a result, we find that QA accuracy is closely related with a metric that considers frequency of words, and as a result of manual analysis, we identify two factors of translation results that affect CLQA accuracy. One is mistranslation of content words and another is lack of question type words. In addition, we show that using a metric which has high correlation with CLQA accuracy can improve CLQA accuracy by choosing an appropriate translation result from translation candidates.
Document similarity measuring techniques are used to evaluate both content and writing style. Evaluation measures for comparing the summary or translation of a system-generated source text with that of human-generated text have been proposed in text summarization and machine translation fields. The distance metrics are measures in terms of morphemes or morpheme sequences to evaluate or register different writing styles. In this study, we discuss the relations among the equivalence properties of mathematical metrics, similarities, kernels, ordinal scales, and correlations. In addition, we investigate the behavior of techniques for measuring content and style similarities for several corpora having similar content. The analysis results obtained using different document similarity measurement techniques indicate the instability of the evaluate system.
In statistical machine translation, the pivot translation approach allows for translation of language pairs with little or no parallel data by introducing a third language for which data exists. In particular, the triangulation method, which translates by combining source-pivot and pivot-target translation models into a source-target model is known for its high translation accuracy. However, in the conventional triangulation method, information of pivot phrases is forgotten, and not used in the translation process. In this research, we propose a novel approach to remember the pivot phrases in the triangulation stage, and use a pivot language model as an additional information source at translation phase. Experimental results on the united nations parallel corpus showed significant improvements in all tested combinations of languages.