IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Regular Section
ParaLite: A Parallel Database System for Data-Intensive Workflows
Ting CHENKenjiro TAURA
著者情報
ジャーナル フリー

2014 年 E97.D 巻 5 号 p. 1211-1224

詳細
抄録
To better support data-intensive workflows which are typically built out of various independently developed executables, this paper proposes extensions to parallel database systems called User-Defined eXecutables (UDX) and collective queries. UDX facilitates the description of workflows by enabling seamless integrations of external executables into SQL statements without any efforts to write programs confirming to strict specifications of databases. A collective query is an SQL query whose results are distributed to multiple clients and then processed by them in parallel, using arbitrary UDX. It provides efficient parallelization of executables through the data transfer optimization algorithms that distribute query results to multiple clients, taking both communication cost and computational loads into account. We implement this concept in a system called ParaLite, a parallel database system based on a popular lightweight database SQLite. Our experiments show that ParaLite has several times higher performance over Hive for typical SQL tasks and has 10x speedup compared to a commercial DBMS for executables. In addition, this paper studies a real-world text processing workflow and builds it on top of ParaLite, Hadoop, Hive and general files. Our experiences indicate that ParaLite outperforms other systems in both productivity and performance for the workflow.
著者関連情報
© 2014 The Institute of Electronics, Information and Communication Engineers
前の記事 次の記事
feedback
Top