IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Special Section on Intelligent Information and Communication Technology and its Applications to Creative Activity Support
Grid-Based Parallel Algorithms of Join Queries for Analyzing Multi-Dimensional Data on MapReduce
Miyoung JANGJae-Woo CHANG
Author information
JOURNAL FREE ACCESS

2018 Volume E101.D Issue 4 Pages 964-976

Details
Abstract

Recently, the join processing of large-scale datasets in MapReduce environments has become an important issue. However, the existing MapReduce-based join algorithms suffer from too much overhead for constructing and updating the data index. Moreover, the similarity computation cost is high because the existing algorithms partition data without considering the data distribution. In this paper, we propose two grid-based join algorithms for MapReduce. First, we propose a similarity join algorithm that evenly distributes join candidates using a dynamic grid index, which partitions data considering data density and similarity threshold. We use a bottom-up approach by merging initial grid cells into partitions and assigning them to MapReduce jobs. Second, we propose a k-NN join query processing algorithm for MapReduce. To reduce the data transmission cost, we determine an optimal grid cell size by considering the data distribution of randomly selected samples. Then, we perform kNN join by assigning the only related join data to a reducer. From performance analysis, we show that our similarity join query processing algorithm and our k-NN join algorithm outperform existing algorithms by up to 10 times, in terms of query processing time.

Content from these authors
© 2018 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top