The first International Conference Networking and Computing (ICNC) on November 17-19, 2010, in Higashi-Hiroshima, Japan, - aims to provide a timely forum for exchange and discussion of the latest research findings in all aspects of networking and computing including parallel and distributed systems, architectures, and applications.
Also, four workshops, 2nd Workshop on Ultra Performance and Dependable Acceleration Systems (UPDAS), 2nd International Workshop on Parallel and Distributed Algorithms and Applications (PDAA), International Workshop on Advances in Networking and Computing (WANC), and Workshop on Dependability of Network Software Applications (DNSA), were held in conjunction with ICNC.
The program committee has encouraged the authors of selected papers including the workshops to submit full-versions of their manuscripts to the International Journal on Networking and Computing (IJNC) after the conference. After a thorough reviewing process, with extensive discussions, ten articles on various topics have been selected for publication on the IJNC special issue on ICNC.
On behalf of the ICNC, we would like to express our appreciation for the large efforts of reviewers who reviewed papers submitted to the special issue. Likewise, we thank all the authors for submitting their excellent manuscripts to this special issue. We also express our sincere thanks to the editorial board of the International Journal on Networking and Computing, in particular, to the Editor-in-chief Professor Koji Nakano. This special issue would not have been possible without his support.
Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the Hessenberg reduction step and consider accelerating it in a hybrid CPU-GPU computing environment. Considering that the Hessenberg reduction algorithm consists almost entirely of BLAS (Basic Linear Algebra Subprograms) operations, we propose three approaches for distributing the BLAS operations between CPU and GPU. Among them, the third approach, which assigns small-size BLAS operations to CPU and distributes large-size BLAS operations between CPU and GPU in some optimal manner, was found to be consistently faster than the other two approaches. On a machine with an Intel Core i7 processor and an NVIDIA Tesla C1060 GPU, this approach achieved 3.2 times speedup over the CPU-only case when computing the Hessenberg form of a 8,192 × 8,192 real matrix.
Modern multiprocessor architectures have exacerbated problems of coordinating access to shared data, in particular as regards to the possibility of deadlock. For example semaphores, one of the most basic synchronization primitives, present difficulties. Djikstra defined semaphores to solve the problem of mutual exclusion. Practical implementation of the concept has, however, produced semaphores that are prone to deadlock, even while the original definition is theoretically free of it. This is not simply due to bad programming, but we have lacked a theory that allows us to understand the problem. We introduce a formal definition and new general theory of synchronization. We illustrate its applicability by deriving basic deadlock properties, to show where the problem lies with semaphores and also to guide us in finding some simple modifications to semaphores that greatly ameliorate the problem. We suggest some future directions for deadlock resolution that also avoid resource starvation.
Recursive dual-net (RDN) is a newly proposed interconnection network for massive parallel computers. The RDN is based on recursive dual-construction of a symmetric base-network B. A k-level dual-construction for k > 0 creates a network RDNk(B) containing N = (2n0)2k/2 nodes with node-degree d0 + k, where n0 and d0 are the number of nodes and the node-degree of the base network, respectively. The RDN is a symmetric graph and can contain huge number of nodes with small node-degree and short diameter. Node-to-set disjoint-paths routing is fundamental and has many applications for fault-tolerant and secure communications in a network. In this paper, we propose an efficient algorithm for node-to-set disjoint-paths routing in RDN. We show that, given a node s and a set of d0 + k nodes T in RDNk(B), d0 + k disjoint paths, each connecting s to a node in T, can be found in O(((d0 + k)D0/ lg n0) lg N) time, and the length of the paths is at most 3(D0/2+1)(lg N +1)/(lg n0+1), where N is the number of nodes in RDNk(B), d0 , D0, and n0 are the node-degree, the diameter, and the number of nodes of base-network B, respectively.
Two new multiprocessor architectures to accelerate the simulation of multi-agent systems based on the massively parallel GCA (Global Cellular Automata) model are presented. The GCA model is suited to describe and simulate different multi-agent systems. The designed and implemented architectures mainly consist of a set of processors (NIOS II) and a network. The multiprocessor systems allow the implementation in a flexible way through programming, thus simulating different behaviors on the same architecture. Two architectures, one with up to 16 processors, were implemented on an FPGA. The first architecture uses hardware hash functions in order to reduce the overall simulation time, but lacks scalability. The second architecture uses an agent memory and a cell field memory. This improves the scalability and further increases the performance.
Photomosaic generation is a popular non-photorealistic rendering technique, where a single image is assembled from several smaller ones. Visual responses change depending on the proximity to the photomosaic, leading to many creative prospects for publicity and art. Synthesizing photomosaics typically requires very large image databases in order to produce pleasing results. Moreover, repetitions are allowed to occur which may locally bias the mosaic. This paper provides alternatives to prevent repetitions while still being robust enough to work with coarse image subsets. Three approaches were considered for the matching stage of photomosaics: a greedy-based procedural algorithm, simulated annealing and SoftAssign. It was found that the latter delivers adequate arrangements in cases where only a restricted number of images is available. This paper introduces a novel GPU-accelerated SoftAssign implementation that outperforms an optimized CPU implementation by a factor of 60 times in the tested hardware.
GPU-based computing has become one of the popular high performance computing fields. The field is called GPGPU. This paper is focused on design and implementation of a uniform GPGPU application that is optimized for both the legacy and the recent GPU architectures. As a typical example of such the GPGPU application, this paper will discuss the uniform implementation of the Caravela platform. Especially the flow-model execution mechanism will be considered referring the recent GPU architectures. To verify the design and the implementation on CUDA and OpenCL platform, this paper will evaluate the compatibility among the architectures, and also test measurements of performance.
Electrical network-on-chip (NoC) faces critical challenges in meeting the high performance and low power consumption requirements for future multicore processors interconnection. Recent tremendous advances in CMOS compatible optical components give the potential for photonics to deliver an efficient NoC performance at an acceptable energy cost. However, the lack of in flight processing and buffering of optical data made the realization of a fully optical NoC complicated. A hybrid architecture which uses optical high bandwidth transfer and an electrical control network can take advantage of both interconnection methods to offer an efficient performance-per-watt infrastructure to connect multicore processors and system-on-chip (SoC). In this paper, we propose a predictive switching and a reservation based path setup techniques to reduce the path setup latency of such hybrid photonic network-on-chip (HPNoC). By using these techniques, it is possible to reduce the latency for end-to-end communication in a HPNoC improving its overall performance. In the simulation, we use a cycle accurate simulator under uniform, neighbor, and bitreversal traffic patterns for a 64-node torus topology. The results show that the proposed techniques considerably improve the overall latency of HPNoC.
Given a 2-D binary image of size n×n, Euclidean Distance Map (EDM) is a 2-D array of the same size such that each element is storing the Euclidean distance to the nearest black pixel. It is known that a sequential algorithm can compute the EDM in O(n2) and thus this algorithm is optimal. Also, work-time optimal parallel algorithms for shared memory model have been presented. However, the presented parallel algorithms are too complicated to implement in existing shared memory parallel machines. The main contribution of this paper is to develop a simple parallel algorithm for the EDM and implement it in two different parallel platforms: multicore processors and Graphics Processing Units (GPUs). We have implemented our parallel algorithm in a Linux server with four Intel hexad-core processors (Intel Xeon X7460 2.66GHz). We have also implemented it in the following two modern GPU systems, Tesla C1060 and GTX 480, respectively. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation in the multicore system achieves a speedup factor of 18 over the performance of a sequential algorithm using a single processor in the same system. Meanwhile, for the same input binary image, our implementation on the GPU achieves a speedup factor of 26 over the sequential algorithm implementation.
The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048-bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex-6 family FPGA. The implementation results showed that our RSA module for 2048-bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.