The Third International Symposium on Networking and Computing (CANDAR 2015) was held in Sapporo, Japan, from December 8th to 11th, 2015. The organizers of the CANDAR 2015 invited authors to submit the extended version of the presented papers. As a result, 22 articles have been submitted to this special issue. This issue includes the extended version of 14 papers that have been accepted.
This issue owes a great deal to a number of people who devoted their time and expertise to handle the submitted papers. In particular, I would like to thank the guest editors for the excellent review process: Professor Satoshi Fujita, Professor Akihiro Fujiwara, Professor Nobuo Funabiki, Professor Shuichi Ichikawa, Professor Katsunobu Imai, Professor Yoshiaki Kakuda, Professor Michihiro Koibuchi, Professor Kazuhiko Komatsu, Professor Shirley Moore, Professor Toru Nakanishi, Professor Yasuyuki Nogami, Professor Tomoyuki Ohta, Professor Ferdinand Peper, Professor Hiroyuki Sato, Professor Reiji Suda, Professor Hiroyuki Takizawa, and Professor Takashi Yokota.
Words of gratitude are also due to the anonymous reviewers who carefully read the papers and provided detailed comments and suggestions to improve the quality of the submitted papers. This special issue would not have been without their efforts.
Embedding graphs on the torus is a problem with both theoretical and practical importance. It is required to embed a graph on the torus for solving many application problems such as VLSI design, graph drawing and so on. Polynomial time and exponential time algorithms for embedding graphs on the torus are known. However, the polynomial time algorithms are very complex and their implementation has been a challenge for a long time. On the other hand, the implementations of some exponential time algorithms are known but they are not efficient for large graphs in practice. To develop an efficient practical tool for embedding graphs on the torus, we propose a new exponential time algorithm for embedding graphs on the torus. Compared with a well used previous exponential time algorithm, our algorithm has a better practical running time.
As the diversity of high-performance computing (HPC) systems increases, even legacy HPC applications often need to use accelerators for higher performance. To migrate large-scale legacy HPC applications to modern HPC systems equipped with accelerators, a promising way is to use OpenACC because its directive-based approach can prevent drastic code modifications. This paper shows translation of a large-scale simulation code for an OpenACC platform by keeping the maintainability of the original code. Although OpenACC enables an application to use accelerators by adding a small number of directives, it requires modifying the original code to achieve a high performance in most cases, which tends to degrade the code maintainability and performance portability. To avoid such code modifications, this paper adopts a code translation framework, Xevolver. Instead of directly modifying a code, a pair of a custom code translation rule and a custom directive is defined, and is applied to the original code using the Xevolver framework. This paper first shows that simply inserting OpenACC directives does not lead to high performance and non-trivial code modifications are required in practice. In addition, the code modifications sometimes decrease the performance when migrating a code to other platforms, which leads to low performance portability. The direct code modifications can be avoided by using pairs of an externally-defined translation rule and a custom directive to keep the original code unchanged as much as possible. Finally, the performance evaluation shows that the performance portability can be improved by selectively applying translation with the Xevolver framework compared with directly modifying a code.
We propose a static random access memory based complementary metal-oxide semiconductor LSI chip that accelerates ground-state searches of an Ising model. Escaping local minima is a key feature in creating such a chip. We describe a method for escaping the local minima by asynchronously distributing random pulses. The random pulses are input from outside the chip and propagated through two asynchronous paths. In an experiment using a prototype of our chip, our method achieved the same solution accuracy as the conventional method. The solution accuracy is further improved by dividing the random pulse distribution paths and increasing the number of pseudo random number generators.
GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to mitigate this drawback, NVIDIA introduced an extension to GPU programming models called Dynamic Parallelism. This extension enables GPU programs to spawn new units of work directly on the GPU, allowing the refinement of subsequent work items based on intermediate results without any involvement of the main CPU.
This work investigates methods for employing Dynamic Parallelism with the goal of improved workload distribution for tree search algorithms on modern GPU hardware. For the evaluation of the proposed approaches, a case study is conducted on the N-Queens problem. Extensive benchmarks indicate that the benefits of improved resource utilization fail to outweigh high management overhead and runtime limitations due to the very fine level of granularity of the investigated problem. However, novel memory management concepts for passing parameters to child grids are presented. These general concepts are applicable to other, more coarse-grained problems that benefit from the use of Dynamic Parallelism.
A method of constructing a cellular automaton (CA) from numerical solutions of a given partial differential equation (PDE) is considered. It consists of two parts, namely, collecting spatiotemporal data numerically and finding local rules of a CA that appear most frequently. In this paper, we analyze the method mathematically to examine its selectivity and its robustness of the derived local rules so that we can ensure validity of the resultant CA model. In particular, we investigated two limit cases: (a) the number of states of CA goes to infinity and (b) the number of spatiotemporal data goes to infinity. In the former case, we prove that the resultant CA converges to the difference equation where numerical solutions of a PDE are collected. In the latter case, through mathematical analysis, we derive conditions that the resultant CA is uniquely determined when the method of constructing a CA is applied to the diffusion equation. Our study can be a theoretical foundation of empirical CA modeling methods to create a reasonable CA which can somehow reproduce the original behavior of datasets under consideration.
Since recent scientific and engineering simulations require heavy computations with large volumes of data, High-performance Computing (HPC) systems need a high computational capability with a large memory capacity. Most recent HPC systems adopt a parallel processing architecture, where the computational capability of the processors is increasing, however, the performance of the memory system is constrained. The bytes per flop (B/F), which is a ratio of the memory bandwidth to the flop/s, for the HPC systems have been reduced with the evolution of the HPC systems. To fully exploit the potential of the recent HPC systems, and to meet the increasing demand for large memory, it is necessary to optimize practical scientific and engineering applications, considering not only the parallelism of the applications, but also the limitations of the memory subsystems of the HPC systems. In this paper, we discuss a set of approaches to optimization of the memory access behavior of the applications, which enable their executions with improved performance on the recent HPC systems. Our approaches include memory optimizations through memory footprint controlling, restructuring of data structures for active elements, redundant data structure elimination through combined calculations and optimized re-calculation of data. To validate the effectiveness of our approaches, a plasmonics simulation application is evaluated on vector platforms NEC SX-ACE, NEC SX-9, and Intel Xeon based platform NEC LX 406-Re2. By applying our approaches to the implementation, the memory usage of the plasmonics simulation application can be reduced up to nearly 1/71 of the original, and its execution can be possible on a single node of a distributed parallel system with smaller memory capacity. The optimization results in 1.14 times faster execution on SX-ACE and 1.81 times faster execution on LX 406-Re2.
High performance scientific codes are written to achieve high performance on a modern HPC (High Performance Computing) platform, and are less readable and less manageable because of complex hand optimization which is often platform-dependent. We are developing a toolset to mitigate that maintainability problem by user-defined easy-to-use code transformation: The science code is written in a simpler form, and coding techniques for high performance are introduced by code transformations. In this paper, we present xevtgen, which is a code transformer generator of our toolset. Transformation rules are defined using dummy Fortran codes with some directives, and we expect our design makes our tool easier to learn for Fortran programmers. Some examples of code transformations, as well as an application to a real scientific application, are shown to discuss the practicality of the proposed approach. Xevtgen assumes XSLT as a backend, and generates an XSLT template from the dummy Fortran code. That design of xevtgen exploits the power of XSLT, and inherits some limitations of XSLT. In our plan, those limitations will be mitigated by additional tools in our toolset.
The computational power and the physical memory size of a single GPU device are often insufficient for large-scale problems. Using CUDA, the user must explicitly partition such problems into several tasks repeating the data transfers and kernel executions. To use multiple GPUs, explicit device switching is also needed. Furthermore, low-level hand optimizations such as load balancing and determining task granularity are required to achieve high performance. To handle large-scale problems without any additional user code, we introduce an implicit dynamic task scheduling scheme to our CUDA variation MESI-CUDA. MESI-CUDA is designed to abstract the low-level GPU features; virtual shared variables and logical thread mappings hide the complex memory hierarchy and physical characteristics. On the other hand, explicit parallel execution using kernel functions is the same as in CUDA. In our scheme, each kernel invocation in the user code is translated into a job submission to the runtime scheduler. The scheduler partitions a job into tasks considering the device memory size and dynamically schedules them to the available GPU devices. Thus the user can simply specify kernel invocations independent of the execution environment. The evaluation result shows that our scheme can automatically utilize heterogeneous GPU devices with small overhead.
During natural disasters, a significant amount of information is shared over the Internet. Therefore, it is desirable to provide disaster information based on information about individual users. However, there is a trade-off between the protection of user information and the quality of services that should be considered when providing disaster information. We propose a method that rationally determines the extent of user information to be disclosed. The effectiveness of the proposed method was evaluated experimentally. The experiments were conducted using the proposed method and a simple determination method wherein both the utility and intention of the user were considered relative to the extent of user information disclosure. In addition, the extent to which the trade-off was considered for each user was evaluated quantitatively.
Ad hoc networks are autonomously distributed wireless networks which consist of wireless terminals (hereinafter, referred to as nodes). They do not rely on wireless network infrastructures such as base stations. Relaying nodes and their surrounding nodes are susceptible to data theft and eavesdropping because nodes communicate via radio waves. Previously, we had proposed the secure dispersed data transfer method for encryption, decryption, and transfer of the original data packets. To use the secure dispersed data transfer method securely, we had proposed using the node-disjoint multipath routing method. In this method, multiple versions of encrypted data packets are transferred along each disjoint multipath to counter data packet theft. We had also proposed the enhanced version of the aforementioned routing method to reduce radio area overlap by using rebroadcasting of control packets to counter eavesdropping attacks. In this paper, we propose a multipath routing method to reduce radio area overlap through the introduction of control packet overhearing. We introduce control packet overhearing mechanisms to eliminate excess control packet counts and latency in the pathfinding process. Our main contributions are as follows: (1) our proposed method can reduce radio area overlap without each node's geographical location information (e.g., using GPS information); (2) our proposed method also can eliminate excess control packets and latency without degradation of the security. Furthermore we conducted simulation experiments to evaluate our proposed method. We observed that our proposed method can construct the desired paths with a smaller amount of control packets and a shorter latency in the pathfinding process. We also conducted additional experiments to discuss the applicable scope of our proposed method. As a result, we confirmed that our proposed method was more effective as the average number of adjacent nodes increased.
In this paper, another version of the star cube called the generalized-star cube, GSC(n, k, m), is presented as a three level interconnection topology. GSC(n, k, m) is a product graph of the (n, k)-star graph and the m-dimensional hypercube (m-cube). It can be constructed in one of two ways: to replace each node in an m-cube with an (n, k)-star graph, or to replace each node in an (n, k)-star graph with an m-cube. Because there are three parameters m, n, and k, the network size of GSC(n, k, m) can be changed more flexibly than the star graph, star-cube, and (n, k)-star graph. We first investigate the topological properties of the GSC(n, k, m), such as the node degree, diameter, average distance, and cost. Also, the regularity and node symmetry of the GSC(n, k, m) are derived. Next, we present a formal shortest-path routing algorithm. Then, we give broadcasting algorithms for both of the single-port and all-port models. To develop these algorithms, we use the spanning binomial tree, the neighborhood broadcasting algorithm, and the minimum dominating set. The complexities of the routing and broadcasting algorithms are also examined.
An Elastic Wireless Local-Area Network (WLAN) system provides a reliable, flexible, and efficient Internet access to users through installations of heterogeneous access points (APs) including dedicated APs (DAPs), virtual APs (VAPs), and mobile APs (MAPs). The number of APs should be carefully selected to optimize the network performance. Specifically, for heavy traffic, a large number of APs are required. However, the dense deployment of APs introduces the inter-AP interferences which may eventually degrade the communication quality when the number of users are few. In this paper, we propose an active access-point configuration algorithm that activates or deactivates APs according to the changes of network topologies and demands of users for the elastic WLAN system. The algorithm considers the bandwidth difference among heterogeneous AP devices and the total available bandwidth in the network. The number of active APs is minimized to ensure the minimum inter-AP interference subject to the constraints. The host locations can be the candidate positions for the MAPs, because host owners may use them for the Internet access. The effectiveness of the proposed algorithm is demonstrated using the WIMNET simulator.
The FDFM (Few DSP slices and Few block Memories) approach is an efficient approach which implements a processor core executing a particular algorithm using few DSP slices and few block RAMs in a single FPGA. Since a processor core based on the FDFM approach uses few hardware resources, hundreds of processor cores working in parallel can be implemented in an FPGA. The main contribution of this paper is to develop a processor core that executes Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in an FPGA. This processor core that we call GCD processor core uses only one DSP slice and one block RAM, and 1280 GCD processors can be implemented in a Xilinx Virtex-7 family FPGA XC7VX485T-2. The experimental results show that the performance of this FPGA implementation using 1280 GCD processor cores is 0.0904μs per one GCD computation for two 1024-bit integers. Quite surprisingly, it is 3.8 times faster than the best GPU implementation and 316 times faster than a sequential implementation on the Intel Xeon CPU.