2021 Volume 29 Pages 157-165
The quantity of data available for analysis, including data collected by sensors and wearable devices, has been increasing hugely. However, to obtain accurate analysis results, data pre-processing such as outlier detection, handling of missing data, and preparing data recorded by different measuring instruments in different units, is essential. Considering that the pre-processing task consumes 80% of analyst resources, we previously proposed a method to address this problem. The method integrates machine learning based on Bayesian inference with human knowledge by using programming by example approach. However, in situations in which the process of generating the model and the process of updating the model are executed at different sites, the previous method is problematic in two ways: 1) all sites have to use the same features defined when the model is generated, and 2) a helpful process to generate new training data from features without using inference data when updating the model, is not available. This prompted us to propose APREP-S, which has flexible feature processes and a process for updating the model using a clustering method. We evaluate the accuracy of the imputation and the similarity of the trends by comparing APREP-S with the original data and other existing methods. The results show that APREP-S can return the most optimal methods with both accuracy and similarity.