本論文では,分布仮説に基づく同義語獲得を行う際に,周辺単語の様々な属性情報を活用するために,文脈限定 Skip-gram モデルを提案する.既存の Skip-gram モデルでは,学習対象となる単語の周辺単語(文脈)を利用して,単語ベクトルを学習する.一方,提案する文脈限定 Skip-gram モデルでは,周辺単語を,特定の品詞を持つものや特定の位置に存在するものに限定し,各限定条件に対して単語ベクトルを学習する.したがって,各単語は,様々な限定条件を反映した複数の単語ベクトルを所持する.提案手法では,これら複数種類の単語ベクトル間のコサイン類似度をそれぞれ計算し,それらを,線形サポートベクトルマシンと同義対データを用いた教師あり学習により合成することで,同義語判別器を構成する.提案手法は単純なモデルの線形和として構成されるため,解釈可能性が高い.そのため,周辺単語の様々な単語属性が同義語獲得に与える影響の分析が可能である.また,限定条件の変更も容易であり,拡張可能性も高い.実際のコーパスを用いた実験の結果,多数の文脈限定 Skip-gram モデルの組合せを利用することで,単純な Skip-gram モデルに比べて同義語獲得の精度を上げられることがわかった.また,様々な単語属性に関する重みを調査した結果,日本語の言語特性を適切に抽出できていることもわかった.
The surge of social media use, such as Twitter, introduces new opportunities for understanding and gauging public mood across different cultures. However, the diversity of expression in social media presents a considerable challenge to this task of opinion mining, given the limited accuracy of sentiment classification and a lack of intercultural comparisons. Previous Twitter sentiment corpora have only global polarities attached to them, which prevents deeper investigation of the mechanism underlying the expression of feelings in social media, especially the role and influence of rhetorical phenomena. To this end, we construct an annotated corpus for multilingual Twitter sentiment understanding that encompasses three languages (English, Japanese, and Chinese) and four international topics (iPhone 6, Windows 8, Vladimir Putin, and Scottish Independence); our corpus incorporates 5,422 tweets. Further, we propose a novel annotation scheme that embodies the idea of separating emotional signals and rhetorical context, which, in addition to global polarity, identifies rhetoric devices, emotional signals, degree modifiers, and subtopics. Next, to address low inter-annotator agreement in previous corpora, we propose a pivot dataset comparison method to effectively improve the agreement rate. With manually annotated rich information, our corpus can serve as a valuable resource for the development and evaluation of automated sentiment classification, intercultural comparison, rhetoric detection, etc. Finally, based on observations and our analysis of our corpus, we present three key conclusions. First, languages differ in terms of emotional signals and rhetoric devices, and the idea that cultures have different opinions regarding the same objects is reconfirmed. Second, each rhetoric device maintains its own characteristics, influences global polarity in its own way, and has an inherent structure that helps to model the sentiment that it represents. Third, the models of the expression of feelings in different languages are rather similar, suggesting the possibility of unifying multilingual opinion mining at the sentiment level.
Ideally, tree-to-tree machine translation (MT) that utilizes syntactic parse trees on both source and target sides could preserve non-local structure, and thus generate fluent and accurate translations. In practice, however, firstly, high quality parsers for both source and target languages are difficult to obtain; secondly, even if we have high quality parsers on both sides, they still can be non-isomorphic because of the annotation criterion difference between the two languages. The lack of isomorphism between the parse trees makes it difficult to extract translation rules. This extremely limits the performance of tree-to-tree MT. In this article, we present an approach that projects dependency parse trees from the language side that has a high quality parser, to the side that has a low quality parser, to improve the isomorphism of the parse trees. We first project a part of the dependencies with high confidence to make a partial parse tree, and then complement the remaining dependencies with partial parsing constrained by the already projected dependencies. Experiments conducted on the Japanese-Chinese and English-Chinese language pairs show that our proposed method significantly improves the performance on both the two language pairs.
ソーシャルメディア等の崩れた日本語の解析においては,形態素解析辞書に存在しない語が多く出現するため解析誤りが新聞等のテキストに比べ増加する.辞書に存在しない未知語の中でも,既知の辞書語からの派生に関しては,正規形を考慮しながら解析するという表記正規化との同時解析の有効性が確認されている.本研究では,これまで焦点があてられていなかった,文字列の正規化パタン獲得に着目し,アノテーションデータから文字列の正規化パタンを統計的に抽出する.統計的に抽出した文字列正規化パタンと文字種正規化を用いて辞書語の候補を拡張し形態素解析を行った結果,従来法よりも再現率,精度ともに高い解析結果を得ることができた.