The Second International Symposium on Computing and Networking (CANDAR 2014) was held in Shizuoka, Japan, from December 10th to 12th, 2014. The organizers of the CANDAR 2014 invited authors to submit the extended version of the presented papers. As a result, 14 articles have been submitted to this special issue. This issue includes the extended version of 7 papers that have been accepted.This issue owes a great deal to a number of people who devoted their time and expertise to handle the submitted papers. In particular, I would like to thank the guest editors for the excellent review process: Professor Shuichi Ichikawa, Professor Katsunobu Imai, Professor Hidetsugu Irie, Professor Yoshiaki Kakuda, Professor Susumu Matsumae, Professor Toru Nakanishi, Professor Hiroyuki Sato, Professor Chisa Takano, and Professor Takashi Yokota. Words of gratitude are also due to the anonymous reviewers who carefully read the papers and provided detailed comments and suggestions to improve the quality of the submitted papers. This special issue would not have been without their efforts.
Although General Purpose computation on Graphics Processing Units (GPGPU) is widely used for the high-performance computing, standard programming frameworks such as CUDA and OpenCL are still di cult to use. They require low-level speci cations and the handoptimization is a large burden. Therefore we are developing an easier framework named MESICUDA. Based on a virtual shared memory model, MESI-CUDA hides low-level memory management and data transfer from the user. The compiler generates low-level code and also optimizes memory accesses applying conventional hand-optimizing techniques. However, creating GPU threads is same as CUDA; the user speci es thread mapping, i.e. thread indexing and the size of thread blocks run on each streaming multiprocessors (SM). The mapping largely a ects the execution performance and may obstruct automatic optimization of MESI-CUDA compiler.Therefore, the user must nd optimal speci cation considering physical parameters. In this paper, we propose a new thread mapping scheme. We introduce new thread creation syntax specifying hardware-independent logical mapping, which is converted into optimized physical mapping at compile time. Making static analysis of array index expressions, we obtain groups of threads accessing the same or neighboring array elements. Mapping such threads into the same thread block and assigning consecutive thread indices, the physical mapping is determined to maximize the e ect of memory access optimization. As the result of evaluation, our scheme could nd optimal mapping strategies for ve benchmark programs. Memory access transactions were reduced to approximately 1/4 and 1.4-76 times speedup is achieved compared with the worst mapping.
Thanks to its simple de nition, the hypercube topology is very popular as interconnection network of parallel systems. There have been several routing algorithms described for the hypercube topology, yet in this paper we focus on hypercube routing extended with an additional restriction: bit constraint. Concretely, path selection is performed on a particular subset of nodes: the nodes are required to satisfy a condition regarding their bit weights (a.k.a. Hamming weights). There are several applications to such restricted routing, including simpli cation of disjoint paths routing. We propose in this paper two hypercube routing algorithms enforcing such node restriction: first, a shortest path routing algorithm, second a fault tolerant point-topoint routing algorithm. Formal proof of correctness and complexity analysis for the described algorithms are conducted. We show that the shortest path routing algorithm proposed is time optimal. Finally, we perform an empirical evaluation of the proposed fault tolerant point-topoint routing algorithm so as to inspect its practical behaviour. Along with this experimentation,we analyse further the average performance of the proposed algorithm by discussing the average Hamming distance in a hypercube when satisfying a bit constraint.
In this paper, we present a parallel algorithm for enumerating joint weight of a binary linear $(n,k)$ code, aiming at accelerating assessment of its decoding error probability for network coding. Our algorithm is implemented on a multi-core CPU system and an NVIDIA graphics processing unit (GPU) system using OpenMP and compute unified device architecture (CUDA), respectively. To reduce the number of pairs of codewords to be investigated, our parallel algorithm reduces dimension k by focusing on the all-one vector included in many practical codes. We also employ a population count instruction to compute joint weight of codewords with a less number of instructions. Furthermore, an efficient atomic vote and reduce scheme is deployed in our GPU-based implementation. We apply our CPU- and GPU-based implementations to a subcode of a (127,22) BCH code to evaluate the impact of acceleration.
The energy consumption of server farms is steadily increasing. This is mainly due to an increasing number of servers which are often underutilized most of the time. In this paper we discuss various strategies to improve the energy e ciency of a datacenter measured by the average number of operations executed per Joule. We assume a collection of heterogeneous server nodes that are characterized by their SPECpower-benchmarks. If a time-variable divisible (work)load should to be executed on such a datacenter the energy e ciency can be improved by a smart decomposition of this load into appropriate chunks. In the paper we discuss a sophisticated load distribution strategy and extend it by an adaptive power management for dynamically switching underutilized servers to performance states with lower energy consumption. Of course,also transitions to higher performance/energy states are possible if required by the current load. We introduce a new time slice model that allows a reduction of the switching overhead by means of a few merge and adjust cycles. The resulting ALD+ strategy was evaluated in a webserver environment with real Wikipedia traces. It achieved signi cant reductions of the energy consumption by the combination of load distribution and server switching by means of the time slice model. Moreover, ALD+ can be easily integrated into any parallel webserver setup.
The hierarchical dual-net (HDN) is an interconnection network for building ultra-scale parallel systems. The HDN is constructed based on a symmetric product graph (called base network), such as three-dimensional torus and n-dimensional hypercubes. A k-level hierarchical dual-net, HDN(B,k,S), is obtained by applying k-time dual constructions on the base network B. S defines a super-node set that adjusts the scale of the system. The node degree of an HDN(B,k,S) is d0+k where d0 is the node degree of B. The HDN is node and edge symmetric and can contain huge number of nodes with small node degree and short diameter. In this paper, we propose two efficient algorithms for finding a fault-free path on HDN. The first algorithm can always find a fault-free path in O(2kF(B)) time if the number of faulty nodes on HDN is less than d0+k, where F(B) is the time complexity of fault-tolerant routing in the base network. The second algorithm, more practical one, can find a fault-free path on HDN with arbitrary number of faulty nodes. The simulation results show that the second algorithm can find a fault-free path at high probability.
The program monitoring and control mechanisms of virtualization tools are becoming increasingly standardized and advanced. Together with checkpointing, these can be used for general program analysis tools. We explore this idea with an architecture we call Checkpoint-based Fault Injection (CFI), and two concrete implementations using different existing virtualization tools: DMTCP and SBUML. The implementations show interesting trade-offs in versatility and performance as well as the generality of the architecture.
An extension to the software model checker Java Path nder for verifying networked applications using the User Datagram Protocol (UDP) is presented.UDP maximizes performance by omitting ow control and connection handling. For instance,media-streaming services often use UDP to reduce delay and jitter. However, because UDP is unreliable (packets are subject to loss, duplication, and reordering), veri cation of UDP-based applications becomes an issue. Even though unreliable behavior occurs only rarely during testing, it often appears in a production environment due to a larger number of concurrent network accesses.Our tool systematically tests UDP-based applications by producing packet loss, duplication,and reordering for each packet. We have evaluated the performance of our tool in a multi-threaded client/server application and detected incorrectly handled packet duplicates in a le transfer client.