Filtering Method for Twitter Streaming Data Using Human-in-the-Loop Machine Learning

Yu Suzuki

doi:10.2197/ipsjjip.27.404

Abstract

A large number of texts is posted daily on social media. However, only a small portion of these texts is informative for a specific purpose. For example, in order to collect a set of tweets for marketing strategy, we should collect a large number of tweets related to a specific topic with high accuracy. If we accurately filter the texts, we can continuously obtain fresh and useful information in real time. In a keyword-based approach, filters are constructed using keywords, but selecting the appropriate keywords is often tricky. In this work, we propose a method for filtering texts that are related to specific topics using a classification method that is based on crowdsourcing and machine learning. In our approach, we construct a text classifier using fastText and then annotate whether the tweets are related to the topics using crowdsourcing. For constructing an accurate classifier, we should prepare a large amount of learning data. However, this process is costly and time-consuming. To construct an accurate classifier using a small number of learning data, we consider two strategies for selecting tweets which the crowdsourcing participants should assess: optimistic and pessimistic approach. Then, we reconstruct the text classifier using the annotated texts and classify them again. If we continue instigating this loop, the accuracy of the classifier will improve, and we will obtain useful information without having to specify the keywords. Experimental results demonstrate that our proposed system is adequate for filtering social media streams. Moreover, we discovered that the pessimistic approach is better than the optimistic approach.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!