日本法科学技術学会誌
技術報告
テキストマイニングを用いた筆者識別へのスコアリング導入
―文字数やテキスト数,文体的特徴が得点分布に及ぼす影響―
財津 亘金 明哲
著者情報
ジャーナル フリー

22 巻 (2017) 2 号 p. 91-108

詳細
PDFをダウンロード (673K) 発行機関連絡先
抄録

 Author identification through text-mining aims to judge whether an author suspected of writing a certain text is same as that of control texts. This study examined the validity of scoring for author identification. In one unit of analysis, we conducted 18 analyses (six writing styles×three multivariate analyses) across one suspected text of a blogger, one control text of a blogger, and irrelevant texts of four bloggers. The writing style factors were (1) rate of usage of non-independent words, (2) bigram of parts-of-speech, (3) bigram of postpositional particles, (4) positioning of commas, (5) rate of usage of Kanji, Hiragana, etc. and (6) sentence length. We completed (1) principal components analysis, (2) corresponding analysis, and (3) multi-dimensional scaling. We obtained scores from arrangements of texts on two dimensions, convex hull polygon (CHP) consisting of control texts was overlapped with that of irrelevant texts (a score of 0). Besides not overlapping each CHP of control and irrelevant texts, (a score of +2) a suspected text arranged into CHP of control texts, (a score of +1) one not arranged into CHP of control texts but near a control text, and (a score of −1) one near an irrelevant text. We totaled the scores in one unit of analysis (18 results) and analyzed the total scores of the 240 units of analysis for 10 bloggers under the following design: 2 (author combination of suspected and control texts: same, different)×4 (number of characters: 250, 500, 1000, 1500)×3 (number of control and irrelevant texts: 3, 6, 9). The results indicated the scoring method was able to identify the authors. AUCs of number of characters were statistically significant, but the number of texts was not significant. Furthermore, rate of usage of non-independent words and parts-of-speech were quite useful to identify authors.

著者関連情報
© 2017 日本法科学技術学会
前の記事 次の記事

閲覧履歴
feedback
Top