IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Volume E99.D, Issue 12
Displaying 1-38 of 38 articles from this issue
Special Section on Parallel and Distributed Computing and Networking
• Yasuhiko Nakashima
2016 Volume E99.D Issue 12 Pages 2858-2859
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS
• Keisuke MASHITA, Maya TABUCHI, Ryohei YAMADA, Tomoaki TSUMURA
Article type: PAPER
Subject area: Architecture
2016 Volume E99.D Issue 12 Pages 2860-2870
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Lock-based thread synchronization techniques have been commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilites, and Transactional Memory (TM) has been proposed and studied for lock-free synchronization. On TMs, transactions are executed speculatively in parallel as long as they do not encounter any conflicts on shared variables. On general HTMs: hardware implementations of TM, transactions which have conflicted once each other will conflict repeatedly if they will be executed again in parallel, and the performance of HTM will decline. To address this problem, in this paper, we propose a conflict prediction to avoid conflicts before executing transactions, considering historical data of conflicts. The result of the experiment shows that the execution time of HTM is reduced 59.2% at a maximum, and 16.8% on average with 16 threads.

• Hiroshi NAKAHARA, Tomoya OZAKI, Hiroki MATSUTANI, Michihiro KOIBUCHI, ...
Article type: PAPER
Subject area: Architecture
2016 Volume E99.D Issue 12 Pages 2871-2880
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

The increase of recent non-recurrent engineering cost (design, mask and test cost) have made large System-on-Chip (SoC) difficult to develop especially with advanced technology. We radically explore an approach for cheap and flexible chip stacking by using Inductive coupling ThruChip Interface (TCI). In order to connect a large number of small chips for building a large scale system, novel chip stacking methods called the linear stacking and staggered stacking are proposed. They enable the system to be extended to x or/and y dimensions, not only to z dimension. Here, a novel chip staking layout, and its deadlock-free routing design for the case using single-core chips and multi-core chips are shown. The network with 256 nodes formed by the proposed stacking improves the latency of 2D mesh by 13.8% and the performance of NAS Parallel Benchmarks by 5.4% on average compared to that of 2D mesh.

• Yuan HE, Masaaki KONDO, Takashi NAKADA, Hiroshi SASAKI, Shinobu MIWA, ...
Article type: PAPER
Subject area: Architecture
2016 Volume E99.D Issue 12 Pages 2881-2890
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Networks-on-Chip (or NoCs, for short) play important roles in modern and future multi-core processors as they are highly related to both performance and power consumption of the entire chip. Up to date, many optimization techniques have been developed to improve NoC's bandwidth, latency and power consumption. But a clear answer to how energy efficiency is affected with these optimization techniques is yet to be found since each of these optimization techniques comes with its own benefits and overheads while there are also too many of them. Thus, here comes the problem of when and how such optimization techniques should be applied. In order to solve this problem, we build a runtime framework to throttle these optimization techniques based on concise performance and energy models. With the help of this framework, we can successfully establish adaptive selections over multiple optimization techniques to further improve performance or energy efficiency of the network at runtime.

• MinSeong CHOI, Takashi FUKUDA, Masahiro GOSHIMA, Shuichi SAKAI
Article type: PAPER
Subject area: Architecture
2016 Volume E99.D Issue 12 Pages 2891-2900
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

The time taken for processor simulation can be drastically reduced by selecting simulation points, which are dynamic sections obtained from the simulation result of processors. The overall behavior of the program can be estimated by simulating only these sections. The existing methods to select simulation points, such as SimPoint, used for selecting simulation points are deductive and based on the idea that dynamic sections executing the same static section of the program are of the same phase. However, there are counterexamples for this idea. This paper proposes an inductive method, which selects simulation points from the results obtained by pre-simulating several processors with distinctive microarchitectures, based on assumption that sections in which all the distinctive processors have similar istructions per cycle (IPC) values are of the same phase. We evaluated the first 100G instructions of SPEC 2006 programs. Our method achieved an IPC estimation error of approximately 0.1% by simulating approximately 0.05% of the 100G instructions.

• Tatsuya KAWAMOTO, Xin ZHOU, Jacir L. BORDIM, Yasuaki ITO, Koji NAKANO
Article type: PAPER
Subject area: Architecture
2016 Volume E99.D Issue 12 Pages 2901-2910
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Algorithms requiring fast manipulation of multiple-length numbers are usually implemented in hardware. However, hardware implementation, using HDL (Hardware Description Language) for instance, is a laborious task and the quality of the solution relies heavily on the designer expertise. The main contribution of this work is to present a flexible-length-arithmetic processor based on FDFM (Few DSP slices and Few Memory blocks) approach that supports arithmetic operations on multiple-length numbers using FPGAs (Field Programmable Gate Array). The proposed processor has been implement on the Xilinx Virtex-6 FPGA. Arithmetic instructions of the proposed processor architecture include addition, subtraction, and multiplication of integer numbers exceeding 64-bits. To reduce the burden of implementing algorithm directly on the FPGA, applications requiring multiple-length arithmetic operations are written in a C-like language and translated into a machine program. The machine program is then transferred and executed on the proposed architecture. A 2048-bit RSA encryption/decryption implementation has been used to assess the goodness of the proposed approach. Experimental results shows that the computing time, using the proposed architecture, of a 2048-bit RSA encryption takes only 2.2 times longer than a direct FPGA implementation. Furthermore, by employing multiple FDFM cores for the same task, the computing time reduces considerably.

• Takashi YOKOTA, Kanemitsu OOTSU, Takeshi OHKAWA
Article type: PAPER
Subject area: Interconnection network
2016 Volume E99.D Issue 12 Pages 2911-2922
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

State-of-the-art parallel computers, which are growing in parallelism, require a lot of things in their interconnection networks. Although wide spectrum of efforts in research and development for effective and practical interconnection networks are reported, the problem is still open. One of the largest issues is congestion control that intends to maximize the network performance in terms of throughput and latency. Throttling, or injection limitation, is one of the center ideas of congestion control. We have proposed a new class of throttling method, Entropy Throttling, whose foundation is entropy concept of packets. The throttling method is successful in part, however, its potentials are not sufficiently discussed. This paper aims at exploiting capabilities of the Entropy Throttling method via comprehensive evaluation. Major contributions of this paper are to introduce two ideas of hysteresis function and guard time and also to clarify wide performance characteristics in steady and unsteady communication situations. By introducing the new ideas, we extend the Entropy throttling method. The extended methods improve communication performance at most 3.17 times in the best case and 1.47 times in average compared with non-throttling cases in collective communication, while the method can sustain steady communication performance.

• Ryo HAMAMOTO, Chisa TAKANO, Hiroyasu OBATA, Kenji ISHIDA
Article type: PAPER
Subject area: Wireless system
2016 Volume E99.D Issue 12 Pages 2923-2933
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Wireless Local Area Networks (WLANs) based on the IEEE 802.11 standard have been increasingly used. Access Points (APs) are being established in various public places, such as railway stations and airports, as well as private residences. Moreover, the rate of public WLAN services continues to increase. Throughput prediction of an AP in a multi-rate environment, i.e., predicting the amount of receipt data (including retransmission packets at an AP), is an important issue for wireless network design. Moreover, it is important to solve AP placement and selection problems. To realize the throughput prediction, we have proposed an AP throughput prediction method that considers terminal distribution. We compared the predicted throughput of the proposed method with a method that uses linear order computation and confirmed the performance of the proposed method, not by a network simulator but by the numerical computation. However, it is necessary to consider the impact of CSMA/CA in the MAC layer, because throughput is greatly influenced by frame collision. In this paper, we derive an effective transmission rate considering CSMA/CA and frame collision. We then compare the throughput obtained using the network simulator NS2 with a prediction value calculated by the proposed method. Simulation results show that the maximum relative error of the proposed method is approximately 6% and 15% for UDP and TCP, respectively, while that is approximately 17% and 21% in existing method.

Article type: PAPER
Subject area: Sensor network
2016 Volume E99.D Issue 12 Pages 2934-2942
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

This paper provides a mobile agent based distributed variational Bayesian (MABDVB) algorithm for density estimation in sensor networks. It has been assumed that sensor measurements can be statistically modeled by a common Gaussian mixture model. In the proposed algorithm, mobile agents move through the routes of the network and compute the local sufficient statistics using local measurements. Afterwards, the global sufficient statistics will be updated using these local sufficient statistics. This procedure will be repeated until convergence is reached. Consequently, using this global sufficient statistics the parameters of the density function will be approximated. Convergence of the proposed method will be also analytically studied, and it will be shown that the estimated parameters will eventually converge to their true values. Finally, the proposed algorithm will be applied to one-dimensional and two dimensional data sets to show its promising performance.

• Yuichi NAKAMURA, Akira MORIGUCHI, Masanori IRIE, Taizo KINOSHITA, Tosh ...
Article type: PAPER
Subject area: Sensor network
2016 Volume E99.D Issue 12 Pages 2943-2955
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

To reduce the server load and communication costs of machine-to-machine (M2M) systems, sensor data are aggregated in M2M gateways. Aggregation logic is typically programmed in the C language and embedded into the firmware. However, developing aggregation programs is difficult for M2M service providers because it requires gateway-specific knowledge and consideration of resource issues, especially RAM usage. In addition, modification of aggregation logic requires the application of firmware updates, which are risky. We propose a rule-based sensor data aggregation system, called the complex sensor data aggregator (CSDA), for M2M gateways. The functions comprising the data aggregation process are subdivided into the categories of filtering, statistical calculation, and concatenation. The proposed CSDA supports this aggregation process in three steps: the input, periodic data processing, and output steps. The behaviors of these steps are configured by an XML-based rule. The rule is stored in the data area of flash ROM and is updatable through the Internet without the need for a firmware update. In addition, in order to keep within the memory limit specified by the M2M gateway's manufacturer, the number of threads and the size of the working memory are static after startup, and the size of the working memory can be adjusted by configuring the sampling setting of a buffer for sensor data input. The proposed system is evaluated in an M2M gateway experimental environment. Results show that developing CSDA configurations is much easier than using C because the configuration decreases by 10%. In addition, the performance evaluation demonstrates the proposed system's ability to operate on M2M gateways.

• Tatsuyuki MATSUSHITA, Shinji YAMANAKA, Fangming ZHAO
Article type: PAPER
Subject area: Distributed system
2016 Volume E99.D Issue 12 Pages 2956-2967
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Peer-to-peer (P2P) networks have attracted increasing attention in the distribution of large-volume and frequently accessed content. In this paper, we mainly consider the problem of key leakage in secure P2P content distribution. In secure content distribution, content is encrypted so that only legitimate users can access the content. Usually, users (peers) cannot be fully trusted in a P2P network because malicious ones might leak their decryption keys. If the redistribution of decryption keys occurs, copyright holders may incur great losses caused by free riders who access content without purchasing it. To decrease the damage caused by the key leakage, the individualization of encrypted content is necessary. The individualization means that a different (set of) decryption key(s) is required for each user to access content. In this paper, we propose a P2P content distribution scheme resilient to the key leakage that achieves the individualization of encrypted content. We show the feasibility of our scheme by conducting a large-scale P2P experiment in a real network.

• Nannan QIAO, Jiali YOU, Yiqiang SHENG, Jinlin WANG, Haojiang DENG
Article type: PAPER
Subject area: Distributed system
2016 Volume E99.D Issue 12 Pages 2968-2977
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

In this paper, a discrete particle swarm optimization method is proposed to solve the multi-objective task assignment problem in distributed environment. The objectives of optimization include the makespan for task execution and the budget caused by resource occupation. A two-stage approach is designed as follows. In the first stage, several artificial particles are added into the initialized swarm to guide the search direction. In the second stage, we redefine the operators of the discrete PSO to implement addition, subtraction and multiplication. Besides, a fuzzy-cost-based elite selection is used to improve the computational efficiency. Evaluation shows that the proposed algorithm achieves Pareto improvement in comparison to the state-of-the-art algorithms.

• Yunkai DU, Naijie GU, Xin ZHOU
Article type: PAPER
Subject area: Distributed system
2016 Volume E99.D Issue 12 Pages 2978-2985
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Petri Net (PN) is a frequently-used model for deadlock detection. Among various detection methods on PN, reachability analysis is the most accurate one since it never produces any false positive or false negative. Although suffering from the well-known state space explosion problem, reachability analysis is appropriate for small- and medium-scale programs. In order to mitigate the explosion problem several kinds of techniques have been proposed aiming at accelerating the reachability analysis, such as net reduction and abstraction. However, these techniques are for general PN and do not take the particularity of application into consideration, so their optimization potential is not adequately developed. In this paper, the feature of mutual exclusion-based program is considered, therefore several strategies are proposed to accelerate the reachability analysis. Among these strategies a customized net reduction rule aims at reducing the scale of PN, two marking compression methods and two pruning methods can reduce the volume of reachability graph. Reachability analysis on PN can only report one deadlock on each path. However, the reported deadlock may be a false alarm in which situation real deadlocks may be hidden. To improve the detection efficiency, we proposed a deadlock recovery algorithm so that more deadlocks can be detected in a shorter time. To validate the efficiency of these methods, a prototype is implemented and applied to SPLASH2 benchmarks. The experimental results show that these methods accelerate the reachability analysis for mutual exclusion-based deadlock detection significantly.

• Shunji FUNASAKA, Koji NAKANO, Yasuaki ITO
Article type: PAPER
Subject area: GPU computing
2016 Volume E99.D Issue 12 Pages 2986-2994
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

The main contribution of this paper is to present a work-optimal parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, it is not easy to parallelize it. We first present a work-optimal parallel LZW decompression algorithm on the CREW-PRAM (Concurrent-Read Exclusive-Write Parallel Random Access Machine), which is a standard theoretical parallel computing model with a shared memory. We then go on to present an efficient implementation of this parallel algorithm on a GPU. The experimental results show that our GPU implementation performs LZW decompression in 1.15 milliseconds for a gray scale TIFF image with 4096×3072 pixels stored in the global memory of GeForce GTX 980. On the other hand, sequential LZW decompression for the same image stored in the main memory of Intel Core i7 CPU takes 50.1 milliseconds. Thus, our parallel LZW decompression on the global memory of the GPU is 43.6 times faster than a sequential LZW decompression on the main memory of the CPU for this image. To show the applicability of our GPU implementation for LZW decompression, we evaluated the SSD-GPU data loading time for three scenarios. The experimental results show that the scenario using our LZW decompression on the GPU is faster than the others.

• Lucas Saad Nogueira NUNES, Jacir Luiz BORDIM, Yasuaki ITO, Koji NAKANO
Article type: PAPER
Subject area: GPU computing
2016 Volume E99.D Issue 12 Pages 2995-3003
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

The closeness of a match is an important measure with a number of practical applications, including computational biology, signal processing and text retrieval. The approximate string matching (ASM) problem asks to find a substring of string Y of length n that is most similar to string X of length m. It is well-know that the ASM can be solved by dynamic programming technique by computing a table of size m×n. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The proposed GPU implementation relies on warp shuffle instructions which are used to accelerate the communication between threads without resorting to shared memory access. Despite the fact that O(mn) memory access operations are necessary to access all elements of a table with size n×m, the proposed implementation performs only $O(\frac{mn}{w})$ memory access operations, where w is the warp size. Experimental results carried out on a GeForce GTX 980 GPU show that the proposed implementation, called w-SCAN, provides speed-up of over two fold in computing the ASM as compared to another prominent alternative.

• Takumi HONDA, Yasuaki ITO, Koji NAKANO
Article type: PAPER
Subject area: GPU computing
2016 Volume E99.D Issue 12 Pages 3004-3012
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.

• Atsushi OHTA, Ryota KAWASHIMA, Hiroshi MATSUO
Article type: PAPER
Subject area: Database system
2016 Volume E99.D Issue 12 Pages 3013-3023
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Many distributed systems use a replication mechanism for reliability and availability. On the other hand, application developers have to consider minimum consistency requirement for each application. Therefore, a replication protocol that supports multiple consistency models is required. Multi-Consistency Data Replication (McRep) is a proxy-based replication protocol and can support multiple consistency models. However, McRep has a potential problem in that a replicator relaying all request and reply messages between clients and replicas can be a performance bottleneck and a Single-Point-of-Failure (SPoF). In this paper, we introduce the multi-consistency support mechanism of McRep to a combined state-machine and deferred-update replication protocol to eliminate the performance bottleneck and SPoF. The state-machine and deferred-update protocols are well-established approaches for fault-tolerant data management systems. But each method can ensure only a specific consistency model. Thus, we adaptively select a replication method from the two replication bases. In our protocol, the functionality of the McRep's replicator is realized by clients and replicas. Each replica has new roles in serialization of all transactions and managing all views of the database, and each client has a new role in managing status of its transactions. We have implemented and evaluated the proposed protocol and compared to McRep. The evaluation results show that the proposed protocol achieved comparable throughput of transactions to McRep. Especially the proposed protocol improved the throughput up to 16% at a read-heavy workload in One-Copy. Finally, we demonstrated the proposed failover mechanism. As a result, a failure of a leader replica did not affect continuity of the entire replication system unlike McRep.

• Soramichi AKIYAMA, Takahiro HIROFUCHI, Ryousei TAKANO, Shinichi HONIDE ...
Article type: PAPER
Subject area: Operating system
2016 Volume E99.D Issue 12 Pages 3024-3034
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Live migration plays an important role on improving efficiency of cloud data centers by enabling dynamically replacing virtual machines (VMs) without disrupting services running on them. Although many studies have proposed acceleration mechanisms of live migration, IO-intensive VMs still suffer from long total migration time due to a large amount of page cache. Existing studies for this problem either force the guest OS to delete the page cache before a migration, or they do not consider dynamic characteristics of cloud data centers. We propose a parallel and adaptive transfer of page cache for migrating IO-intensive VMs which (1) does not delete the page cache and is still fast by utilizing the storage area network of a data center, and (2) achieves the shortest total migration time without tuning hand-crafted parameters. Experiments showed that our method reduces total migration time of IO-intensive VMs up to 33.9%.

Regular Section
• Lixin WANG, Yutong LU, Wei ZHANG, Yan LEI
Article type: PAPER
Subject area: Fundamentals of Information Systems
2016 Volume E99.D Issue 12 Pages 3035-3046
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

File system workloads are increasing write-heavy. The growing capacity of RAM in modern nodes allows many reads to be satisfied from memory while writes must be persisted to disk. Today's sophisticated local file systems like Ext4, XFS and Btrfs optimize for reads but suffer from workloads dominated by microdata (including metadata and tiny files). In this paper we present an LSM-tree-based file system, RFS, which aims to take advantages of the write optimization of LSM-tree to provide enhanced microdata performance, while offering matching performance for large files. RFS incrementally partitions the namespace into several metadata columns on a per-directory basis, preserving disk locality for directories and reducing the write amplification of LSM-trees. A write-ordered log-structured layout is used to store small files efficiently, rather than embedding the contents of small files into inodes. We also propose an optimization of global bloom filters for efficient point lookups. Experiments show our library version of RFS can handle microwrite-intensive workloads 2-10 times faster than existing solutions such as Ext4, Btrfs and XFS.

• Chien-Min CHEN, Min-Sheng LIN
Article type: PAPER
Subject area: Fundamentals of Information Systems
2016 Volume E99.D Issue 12 Pages 3047-3052
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Let G be a graph and K be a set of target vertices of G. Assume that all vertices of G, except the vertices in K, may fail with given probabilities. The K-terminal reliability of G is the probability that all vertices in K are mutually connected. This reliability problem is known to be #P-complete for general graphs. This work develops the first polynomial-time algorithm for computing the K-terminal reliability of circular-arc graphs.

• Shengxiao NIU, Gengsheng CHEN
Article type: PAPER
Subject area: Fundamentals of Information Systems
2016 Volume E99.D Issue 12 Pages 3053-3059
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

In this paper, an analysis of the basic process of a class of interactive-graph-cut-based image segmentation algorithms indicates that it is unnecessary to construct n-links for all adjacent pixel nodes of an image before calculating the maximum flow and the minimal cuts. There are many pixel nodes for which it is not necessary to construct n-links at all. Based on this, we propose a new algorithm for the dynamic construction of all necessary n-links that connect the pixel nodes explored by the maximum flow algorithm. These n-links are constructed dynamically and without redundancy during the process of calculating the maximum flow. The Berkeley segmentation dataset benchmark is used to prove that this method can reduce the average running time of segmentation algorithms on the premise of correct segmentation results. This improvement can also be applied to any segmentation algorithm based on graph cuts.

• Yuechao LU, Fumihiko INO, Kenichi HAGIHARA
Article type: PAPER
Subject area: Computer System
2016 Volume E99.D Issue 12 Pages 3060-3071
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.

• Yuttakon YUTTAKONKIT, Shinya TAKAMAEDA-YAMAZAKI, Yasuhiko NAKASHIMA
Article type: PAPER
Subject area: Computer System
2016 Volume E99.D Issue 12 Pages 3072-3081
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Light-field image processing has been widely employed in many areas, from mobile devices to manufacturing applications. The fundamental process to extract the usable information requires significant computation with high-resolution raw image data. A graphics processing unit (GPU) is used to exploit the data parallelism as in general image processing applications. However, the sparse memory access pattern of the applications reduced the performance of GPU devices for both systematic and algorithmic reasons. Thus, we propose an optimization technique which redesigns the memory access pattern of the applications to alleviate the memory bottleneck of rendering application and to increase the data reusability for depth extraction application. We evaluated our optimized implementations with the state-of-the-art algorithm implementations on several GPUs where all implementations were optimally configured for each specific device. Our proposed optimization increased the performance of rendering application on GTX-780 GPU by 30% and depth extraction application on GTX-780 and GTX-980 GPUs by 82% and 18%, respectively, compared with the original implementations.

• Masakazu HIOKI, Hanpei KOIKE
Article type: PAPER
Subject area: Computer System
2016 Volume E99.D Issue 12 Pages 3082-3089
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

A Field Programmable Gate Array (FPGA) with fine-grained body biasing shows satisfactory static power reduction. Contrarily, the FPGA incurs high overhead because additional body bias selectors and electrical isolation regions are needed to program the threshold voltage (Vt) of elemental circuits such as MUX, buffer and LUT in the FPGA. In this paper, low overhead design of FPGA with fine-grained body biasing is described. The FPGA is designed and fabricated on 65-nm SOTB CMOS technology. By not only adopting a customized design rule specifying that reliability is verified by TEGs but downsizing a body bias selector, the FPGA tile area becomes small by 39% compared with the conventional design, resulting in 900 FPGA tiles with 4,4000 programmable Vt regions. In addition, the chip performance is evaluated by implementing 32-bit binary counter in the supply voltage range of 0.5V from 1.2V. The counter circuit operates at a frequency of 72MHz and 14MHz with the supply voltage of 1.2V and 0.5V respectively. The static power saving of 80% in elemental circuits of the FPGA at 0.5-V supply voltage and 0.5-V reverse body bias voltage is achieved in the best case. In the whole chip including configuration memory and body bias selector in addition to elemental circuits, effective static power reduction around 30% is maintained by applying 0.3-V reverse body bias voltage at each supply voltage.

• Md Zia ULLAH, Masaki AONO
Article type: PAPER
Subject area: Data Engineering, Web Information Systems
2016 Volume E99.D Issue 12 Pages 3090-3100
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Web search queries are usually vague, ambiguous, or tend to have multiple intents. Users have different search intents while issuing the same query. Understanding the intents through mining subtopics underlying a query has gained much interest in recent years. Query suggestions provided by search engines hold some intents of the original query, however, suggested queries are often noisy and contain a group of alternative queries with similar meaning. Therefore, identifying the subtopics covering possible intents behind a query is a formidable task. Moreover, both the query and subtopics are short in length, it is challenging to estimate the similarity between a pair of short texts and rank them accordingly. In this paper, we propose a method for mining and ranking subtopics where we introduce multiple semantic and content-aware features, a bipartite graph-based ranking (BGR) method, and a similarity function for short texts. Given a query, we aggregate the suggested queries from search engines as candidate subtopics and estimate the relevance of them with the given query based on word embedding and content-aware features by modeling a bipartite graph. To estimate the similarity between two short texts, we propose a Jensen-Shannon divergence based similarity function through the probability distributions of the terms in the top retrieved documents from a search engine. A diversified ranked list of subtopics covering possible intents of a query is assembled by balancing the relevance and novelty. We experimented and evaluated our method on the NTCIR-10 INTENT-2 and NTCIR-12 IMINE-2 subtopic mining test collections. Our proposed method outperforms the baselines, known related methods, and the official participants of the INTENT-2 and IMINE-2 competitions.

• Kamthorn PUNTUMAPON, Thanawin RAKTHAMAMON, Kitsana WAIYAMAI
Article type: PAPER
Subject area: Artificial Intelligence, Data Mining
2016 Volume E99.D Issue 12 Pages 3101-3109
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Synthetic over-sampling is a well-known method to solve class imbalance by modifying class distribution and generating synthetic samples. A large number of synthetic over-sampling techniques have been proposed; however, most of them suffer from the over-generalization problem whereby synthetic minority class samples are generated into the majority class region. Learning from an over-generalized dataset, a classifier could misclassify a majority class member as belonging to a minority class. In this paper a method called TRIM is proposed to overcome the over-generalization problem. The idea is to identify minority class regions that compromise between generalization and overfitting. TRIM identifies all the minority class regions in the form of clusters. Then, it merges a large number of small minority class clusters into more generalized clusters. To enhance the generalization ability, a cluster connection step is proposed to avoid over-generalization toward the majority class while increasing generalization of the minority class. As a result, the classifier is able to correctly classify more minority class samples while maintaining its precision. Compared with SMOTE and extended versions such as Borderline-SMOTE, experimental results show that TRIM exhibits significant performance improvement in terms of F-measure and AUC. TRIM can be used as a pre-processing step for synthetic over-sampling methods such as SMOTE and its extended versions.

• Hai DAI NGUYEN, Anh DUC LE, Masaki NAKAGAWA
Article type: PAPER
Subject area: Pattern Recognition
2016 Volume E99.D Issue 12 Pages 3110-3118
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

This paper presents deep learning to recognize online handwritten mathematical symbols. Recently various deep learning architectures such as Convolution neural networks (CNNs), Deep neural networks (DNNs), Recurrent neural networks (RNNs) and Long short-term memory (LSTM) RNNs have been applied to fields such as computer vision, speech recognition and natural language processing where they have shown superior performance to state-of-the-art methods on various tasks. In this paper, max-out-based CNNs and Bidirectional LSTM (BLSTM) networks are applied to image patterns created from online patterns and to the original online patterns, respectively and then combined. They are compared with traditional recognition methods which are MRFs and MQDFs by recognition experiments on the CROHME database along with analysis and explanation.

• Kei SAWADA, Akira TAMAMORI, Kei HASHIMOTO, Yoshihiko NANKAKU, Keiichi ...
Article type: PAPER
Subject area: Pattern Recognition
2016 Volume E99.D Issue 12 Pages 3119-3131
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

This paper proposes a Bayesian approach to image recognition based on separable lattice hidden Markov models (SL-HMMs). The geometric variations of the object to be recognized, e.g., size, location, and rotation, are an essential problem in image recognition. SL-HMMs, which have been proposed to reduce the effect of geometric variations, can perform elastic matching both horizontally and vertically. This makes it possible to model not only invariances to the size and location of the object but also nonlinear warping in both dimensions. The maximum likelihood (ML) method has been used in training SL-HMMs. However, in some image recognition tasks, it is difficult to acquire sufficient training data, and the ML method suffers from the over-fitting problem when there is insufficient training data. This study aims to accurately estimate SL-HMMs using the maximum a posteriori (MAP) and variational Bayesian (VB) methods. The MAP and VB methods can utilize prior distributions representing useful prior information, and the VB method is expected to obtain high generalization ability by marginalization of model parameters. Furthermore, to overcome the local maximum problem in the MAP and VB methods, the deterministic annealing expectation maximization algorithm is applied for training SL-HMMs. Face recognition experiments performed on the XM2VTS database indicated that the proposed method offers significantly improved image recognition performance. Additionally, comparative experiment results showed that the proposed method was more robust to geometric variations than convolutional neural networks.

• Yuji OSHIMA, Shinnosuke TAKAMICHI, Tomoki TODA, Graham NEUBIG, Sakrian ...
Article type: PAPER
Subject area: Speech and Hearing
2016 Volume E99.D Issue 12 Pages 3132-3139
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.

• Bei LIU, Makoto P. KATO, Katsumi TANAKA
Article type: PAPER
Subject area: Image Processing and Video Processing
2016 Volume E99.D Issue 12 Pages 3140-3153
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

The use of photo summarization technology to summarize a photo collection is often oriented to users who own the photo collection. However, people's interest in sharing photos with others highlights the importance of cognition-aware summarization of photos by which viewers can easily recognize the exact event those photos represent. In this research, we address the problem of cognition-aware summarization of photos representing events, and propose to solve this problem and to improve the perceptual quality of a photo set by proactively preventing misrecognization that a photo set might bring. Three types of neighbor events that can possibly cause misrecognizations are discussed in this paper, namely sub-events, super-events and sibling-events. We analyze the reasons for these misrecognitions and then propose three criteria to prevent from them. A combination of the criteria is used to generate summarization of photos that can represent an event with several photos. Our approach was empirically demonstrated with photos from Flickr by utilizing their visual features and related tags. The results indicated the effectiveness of our proposed methods in comparison with a baseline method.

• Wiennat MONGKULMANN, Takahiro OKABE, Yoichi SATO
Article type: PAPER
Subject area: Image Recognition, Computer Vision
2016 Volume E99.D Issue 12 Pages 3154-3164
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

We propose a framework to perform auto-radiometric calibration in photometric stereo methods to estimate surface orientations of an object from a sequence of images taken using a radiometrically uncalibrated camera under varying illumination conditions. Our proposed framework allows the simultaneous estimation of surface normals and radiometric responses, and as a result can avoid cumbersome and time-consuming radiometric calibration. The key idea of our framework is to use the consistency between the irradiance values converted from pixel values by using the inverse response function and those computed from the surface normals. Consequently, a linear optimization problem is formulated to estimate the surface normals and the response function simultaneously. Finally, experiments on both synthetic and real images demonstrate that our framework enables photometric stereo methods to accurately estimate surface normals even when the images are captured using cameras with unknown and nonlinear response functions.

• Yuke LI, Weiming SHEN
Article type: PAPER
Subject area: Image Recognition, Computer Vision
2016 Volume E99.D Issue 12 Pages 3165-3171
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Inter-person occlusion handling is a critical issue in the field of tracking, and it has been extensively researched. Several state-of-the-art methods have been proposed, such as focusing on the appearance of the targets or utilizing knowledge of the scene. In contrast with the approaches proposed in the literature, we propose to address this issue using a social interaction model, which allows us to explore spatio-temporal information pertaining to the targets involved in the occlusion situation. Our experimental results show promising results compared with those obtained using other methods.

• Liyu WANG, Qiang WANG, Lan CHEN, Xiaoran HAO
Article type: LETTER
Subject area: Computer System
2016 Volume E99.D Issue 12 Pages 3172-3176
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Many data-intensive applications need large memory to boost system performance. The expansion of DRAM is restricted by its high power consumption and price per bit. Flash as an existing technology of Non-Volatile Memory (NVM) can make up for the drawbacks of DRAM. In this paper, we propose a hybrid main memory architecture named SSDRAM that expands RAM with flash-based SSD. SSDRAM implements a runtime library to provide several transparent interfaces for applications. Unlike using SSD as system swap device which manages data at a page level, SSDRAM works at an application object granularity to boost the efficiency of accessing data on SSD. It provides a flexible memory partition and multi-mapping strategy to manage the physical memory by micro-pages. Experimental results with a number of data-intensive workloads show that SSDRAM can provide up to 3.3 times performance improvement over SSD-swap.

• Sang-Ho HWANG, Ju Hee CHOI, Jong Wook KWAK
Article type: LETTER
Subject area: Software System
2016 Volume E99.D Issue 12 Pages 3177-3180
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

In this letter, we propose a garbage collection technique for non-volatile memory systems, called Migration Cost Sensitive Garbage Collection (MCSGC). Considering the migration overhead from selecting victim blocks, MCSGC increases the lifetime of memory systems and improves response time in garbage collection. Additionally, the proposed algorithm also improves the efficiency of garbage collection by separating cold data from hot data in valid pages. In the experimental evaluation, we show that MCSGC yields up to a 82% improvement in lifetime prolongation, compared with existing garbage collection, and it also reduces erase and migration operations by up to 30% and 29%, respectively.

• Peixin CHEN, Yilun WU, Jinshu SU, Xiaofeng WANG
Article type: LETTER
Subject area: Information Network
2016 Volume E99.D Issue 12 Pages 3181-3184
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

The key escrow problem and high computational cost are the two major problems that hinder the wider adoption of hierarchical identity-based signature (HIBS) scheme. HIBS schemes with either escrow-free (EF) or online/offline (OO) model have been proved secure in our previous work. However, there is no much EF or OO scheme that has been evaluated experimentally. In this letter, several EF/OO HIBS schemes are considered. We study the algorithmic complexity of the schemes both theoretically and experimentally. Scheme performance and practicability of EF and OO models are discussed.

• Jaehwan LEE, Min Jae JO, Ji Sun SHIN
Article type: LETTER
Subject area: Information Network
2016 Volume E99.D Issue 12 Pages 3185-3187
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Current signature-based antivirus solutions have three limitations such as the large volume of signature database, privacy preservation, and computation overheads of signature matching. In this paper, we propose LigeroAV, a light-weight, performance-enhanced antivirus, suitable for pervasive environments such as mobile phones. LigeroAV focuses on detecting MD5 signatures which are more than 90% of signatures. LigeroAV offloads matching computation in the cloud server with up-to-dated signature database while preserving privacy level using the Bloom filter.

• Hao LIU, Hideaki GOTO
Article type: LETTER
Subject area: Information Network
2016 Volume E99.D Issue 12 Pages 3188-3191
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

The privacy of users' data has become a big issue for cloud service. This research focuses on image cloud database and the function of similarity search. To enhance security for such database, we propose a framework of privacy-enhanced search scheme, while all the images in the database are encrypted, and similarity image search is still supported.

• Hong LIU, Mengdi YUE, Jie ZHANG
Article type: LETTER
Subject area: Speech and Hearing
2016 Volume E99.D Issue 12 Pages 3192-3196
Published: December 01, 2016
Released on J-STAGE: December 01, 2016
JOURNAL FREE ACCESS

Sound source localization is an essential technique in many applications, e.g., speech enhancement, speech capturing and human-robot interaction. However, the performance of traditional methods degrades in noisy or reverberant environments, and it is sensitive to the spatial location of sound source. To solve these problems, we propose a sound source localization framework based on bi-direction interaural matching filter (IMF) and decision weighting fusion. Firstly, bi-directional IMF is put forward to describe the difference between binaural signals in forward and backward directions, respectively. Then, a hybrid interaural matching filter (HIMF), which is obtained by the bi-direction IMF through decision weighting fusion, is used to alleviate the affection of sound locations on sound source localization. Finally, the cosine similarity between the HIMFs computed from the binaural audio and transfer functions is employed to measure the probability of the source location. Constructing the similarity for all the spatial directions as a matrix, we can determine the source location by Maximum A Posteriori (MAP) estimation. Compared with several state-of-the-art methods, experimental results indicate that HIMF is more robust in noisy environments.