IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Regular Section
ParaLite: A Parallel Database System for Data-Intensive Workflows
Ting CHENKenjiro TAURA
Author information
JOURNAL FREE ACCESS

2014 Volume E97.D Issue 5 Pages 1211-1224

Details
Abstract

To better support data-intensive workflows which are typically built out of various independently developed executables, this paper proposes extensions to parallel database systems called User-Defined eXecutables (UDX) and collective queries. UDX facilitates the description of workflows by enabling seamless integrations of external executables into SQL statements without any efforts to write programs confirming to strict specifications of databases. A collective query is an SQL query whose results are distributed to multiple clients and then processed by them in parallel, using arbitrary UDX. It provides efficient parallelization of executables through the data transfer optimization algorithms that distribute query results to multiple clients, taking both communication cost and computational loads into account. We implement this concept in a system called ParaLite, a parallel database system based on a popular lightweight database SQLite. Our experiments show that ParaLite has several times higher performance over Hive for typical SQL tasks and has 10x speedup compared to a commercial DBMS for executables. In addition, this paper studies a real-world text processing workflow and builds it on top of ParaLite, Hadoop, Hive and general files. Our experiences indicate that ParaLite outperforms other systems in both productivity and performance for the workflow.

Content from these authors
© 2014 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top