2020 年 28 巻 p. 724-732
It is important to handle large-scale data in text formats such as XML, JSON, and CSV because these data very often appear in data exchange. For these data, instead of data ingestion to databases, ad hoc data extraction is highly desirable. The main issue of ad hoc data extraction is to serve both the programmability to allow handling various types of data intuitively and the performance for large-scale data. To pursue it, we develop CENTAURUS, a dynamic parser generator library for parallel ad hoc data extraction. This paper presents the design and implementation of CENTAURUS. The experimental results on ad hoc data extraction have demonstrated that CENTAURUS outperformed fast dedicated parser libraries in C++ for XML and JSON, and achieved excellent scalability with actions implemented in Python.