To ensure the reliability of the evaluations obtained through crowdsourcing services, this study demonstrated methods of selecting qualified evaluators and reliable ratings, using emotional ratings for nonverbal vocalization obtained via crowdsourcing service. To evaluate the efficiency of the methods, emotional ratings were also obtained through a listening experiment in an in-person laboratory setting. Three filtering criteria were demonstrated, i.e., (a) excluding evaluators who rate more than 45% of assigned samples with a unique value, (b) excluding evaluators who take less than 7 seconds to rate each of assigned samples, and (c) excluding emotion rating instances which are associated with a low self-reported confidence rating. The results of the study showed that the crowdsourcing listening test exhibited similar tendencies to the in-person test, exhibiting high correlation coefficients of 0.873 for arousal, 0.739 for pleasantness, and 0.704 for dominance when the evaluators who took less than 7 seconds to evaluate the speech sample were eliminated. However, the differences in the correlation coefficients were only 0.001–0.007 between the filtered and the non-filtered scores. Moreover, the results revealed that the self-reported confidence scores can eliminate unreliable evaluation ratings, but the correlation improved only marginally.
View full abstract