Japanese Journal of Forensic Science and Technology
Online ISSN : 1881-4689
Print ISSN : 1880-1323
ISSN-L : 1880-1323
Technical Note
Introduction of scoring for author identification by text mining: Effects of the number of characters and texts, and the features of writing style
Wataru ZaitsuMingzhe Jin
Author information
JOURNAL FREE ACCESS

2017 Volume 22 Issue 2 Pages 91-108

Details
Abstract

 Author identification through text-mining aims to judge whether an author suspected of writing a certain text is same as that of control texts. This study examined the validity of scoring for author identification. In one unit of analysis, we conducted 18 analyses (six writing styles×three multivariate analyses) across one suspected text of a blogger, one control text of a blogger, and irrelevant texts of four bloggers. The writing style factors were (1) rate of usage of non-independent words, (2) bigram of parts-of-speech, (3) bigram of postpositional particles, (4) positioning of commas, (5) rate of usage of Kanji, Hiragana, etc. and (6) sentence length. We completed (1) principal components analysis, (2) corresponding analysis, and (3) multi-dimensional scaling. We obtained scores from arrangements of texts on two dimensions, convex hull polygon (CHP) consisting of control texts was overlapped with that of irrelevant texts (a score of 0). Besides not overlapping each CHP of control and irrelevant texts, (a score of +2) a suspected text arranged into CHP of control texts, (a score of +1) one not arranged into CHP of control texts but near a control text, and (a score of −1) one near an irrelevant text. We totaled the scores in one unit of analysis (18 results) and analyzed the total scores of the 240 units of analysis for 10 bloggers under the following design: 2 (author combination of suspected and control texts: same, different)×4 (number of characters: 250, 500, 1000, 1500)×3 (number of control and irrelevant texts: 3, 6, 9). The results indicated the scoring method was able to identify the authors. AUCs of number of characters were statistically significant, but the number of texts was not significant. Furthermore, rate of usage of non-independent words and parts-of-speech were quite useful to identify authors.

Content from these authors
© 2017 Japanese Association of Forensic Science and Technology
Previous article Next article
feedback
Top