日本法科学技術学会誌
Online ISSN : 1881-4689
Print ISSN : 1880-1323
ISSN-L : 1880-1323

この記事には本公開記事があります。本公開記事を参照してください。
引用する場合も本公開記事を引用してください。

テキストマイニングを用いた筆者識別へのスコアリング導入
―文字数やテキスト数,文体的特徴が得点分布に及ぼす影響―
財津 亘金 明哲
著者情報
ジャーナル フリー 早期公開

論文ID: 715

この記事には本公開記事があります。
詳細
抄録

 Author identification through text-mining aims to judge whether an author suspected of writing a certain text is same as that of control texts. This study examined the validity of scoring for author identification. In one unit of analysis, we conducted 18 analyses (six writing styles×three multivariate analyses) across one suspected text of a blogger, one control text of a blogger, and irrelevant texts of four bloggers. The writing style factors were (1) rate of usage of non-independent words, (2) bigram of parts-of-speech, (3) bigram of postpositional particles, (4) positioning of commas, (5) rate of usage of Kanji, Hiragana et al., and (6) sentence length. We completed (1) principal components analysis, (2) corresponding analysis, and (3) multi-dimensional scaling. We obtained scores from arrangements of texts on two dimensions, convex hull polygon (CHP) consisting of control texts was overlapped with that of irrelevant texts (a score of 0). Besides not overlapping each CHP of control and irrelevant texts, (a score of +2) a suspected text arranged into CHP of control texts, (a score of +1) one not arranged into CHP of control texts but near a control text, and (a score of −1) one near an irrelevant text. We totaled the scores in one unit of analysis (18 results) and analyzed the total scores of the 240 units of analysis for 10 bloggers under the following design: 2 (author combination of suspected and control texts: same, different)×4 (number of characters: 250, 500, 1000, 1500)×3 (number of control and irrelevant texts: 3, 6, 9). The results indicated the scoring method was able to identify the authors. AUCs of number of characters were statistically significant, but the number of texts was not significant. Furthermore, rate of usage of non-independent words and parts-of-speech were quite useful to identify authors.

著者関連情報
© 2017 日本法科学技術学会
feedback
Top