International Journal of Networking and Computing

Special Issue on Selected Papers from the First International Conference on Networking and Computing

Special Issue on Selected Papers from the First International Conference Networking and Computing

Yasuaki Ito, Sayaka Kamei

2011 年1 巻2 号 p. 131
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_131

ジャーナルフリー

抄録を表示する抄録を非表示にする

The first International Conference Networking and Computing (ICNC) on November 17-19, 2010, in Higashi-Hiroshima, Japan, - aims to provide a timely forum for exchange and discussion of the latest research findings in all aspects of networking and computing including parallel and distributed systems, architectures, and applications.
Also, four workshops, 2nd Workshop on Ultra Performance and Dependable Acceleration Systems (UPDAS), 2nd International Workshop on Parallel and Distributed Algorithms and Applications (PDAA), International Workshop on Advances in Networking and Computing (WANC), and Workshop on Dependability of Network Software Applications (DNSA), were held in conjunction with ICNC.
The program committee has encouraged the authors of selected papers including the workshops to submit full-versions of their manuscripts to the International Journal on Networking and Computing (IJNC) after the conference. After a thorough reviewing process, with extensive discussions, ten articles on various topics have been selected for publication on the IJNC special issue on ICNC.
On behalf of the ICNC, we would like to express our appreciation for the large efforts of reviewers who reviewed papers submitted to the special issue. Likewise, we thank all the authors for submitting their excellent manuscripts to this special issue. We also express our sincere thanks to the editorial board of the International Journal on Networking and Computing, in particular, to the Editor-in-chief Professor Koji Nakano. This special issue would not have been possible without his support.

抄録全体を表示

PDF形式でダウンロード (17K)
Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems in a Hybrid CPU-GPU Computing Environment

Jun-ichi Muramatsu, Takeshi Fukaya, Shao-Liang Zhang, Kinji Kimura, Yu ...

2011 年1 巻2 号 p. 132-143
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_132

ジャーナルフリー

抄録を表示する抄録を非表示にする

Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the Hessenberg reduction step and consider accelerating it in a hybrid CPU-GPU computing environment. Considering that the Hessenberg reduction algorithm consists almost entirely of BLAS (Basic Linear Algebra Subprograms) operations, we propose three approaches for distributing the BLAS operations between CPU and GPU. Among them, the third approach, which assigns small-size BLAS operations to CPU and distributes large-size BLAS operations between CPU and GPU in some optimal manner, was found to be consistently faster than the other two approaches. On a machine with an Intel Core i7 processor and an NVIDIA Tesla C1060 GPU, this approach achieved 3.2 times speedup over the CPU-only case when computing the Hessenberg form of a 8,192 × 8,192 real matrix.

抄録全体を表示

PDF形式でダウンロード (134K)
Algebra of Synchronization with Application to Deadlock and Semaphores

Ernesto Gomez, Keith Schubert

2011 年1 巻2 号 p. 144-156
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_144

ジャーナルフリー

抄録を表示する抄録を非表示にする

Modern multiprocessor architectures have exacerbated problems of coordinating access to shared data, in particular as regards to the possibility of deadlock. For example semaphores, one of the most basic synchronization primitives, present difficulties. Djikstra defined semaphores to solve the problem of mutual exclusion. Practical implementation of the concept has, however, produced semaphores that are prone to deadlock, even while the original definition is theoretically free of it. This is not simply due to bad programming, but we have lacked a theory that allows us to understand the problem. We introduce a formal definition and new general theory of synchronization. We illustrate its applicability by deriving basic deadlock properties, to show where the problem lies with semaphores and also to guide us in finding some simple modifications to semaphores that greatly ameliorate the problem. We suggest some future directions for deadlock resolution that also avoid resource starvation.

抄録全体を表示

PDF形式でダウンロード (114K)
OpenWeb: Seamless Proxy Interconnection at the Switching Layer

YOSHIO SAKURAUCHI, RICK MCGEER, HIDEYUKI TAKADA

2011 年1 巻2 号 p. 157-177
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_157

ジャーナルフリー

抄録を表示する抄録を非表示にする

In recent years, the amount of Internet traffic has been growing beyond the enhancement of its capacity. Moreover the amount of published information is also growing at an exponential rate. Consequently, there are the demands on performance, robustness, and low latency for a worldwide Internet population. To solve these problems, traditional solutions have led to web proxy cache systems. However, to use such systems, administrators and/or clients are required to do some tedious and error-prone operations because cache systems generally need to be accessed through layer 4-7 scripts and commands, such as the route command on Posix systems, and usually, manual configuration or JavaScript code for a web proxy. If cache systems work at the switching layer (layer-2), administrators can introduce the system just by inserting it into the network and clients can use the system transparently. This paper describes OpenWeb, a layer-2 redirection engine implemented as an application of the OpenFlow switch architecture. New open protocols at the switching layer now enable far more robust and seamless packet redirection, without user configuration or unreliable scripts. In addition, performance evaluations compared with traditional systems and simulations run in random networks show that OpenWeb is clearly beneficial.

抄録全体を表示

PDF形式でダウンロード (1589K)
Node-to-Set Disjoint-Paths Routing in Recursive Dual-Net

Yamin Li, Shietung Peng, Wanming Chu

2011 年1 巻2 号 p. 178-190
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_178

ジャーナルフリー

抄録を表示する抄録を非表示にする

Recursive dual-net (RDN) is a newly proposed interconnection network for massive parallel computers. The RDN is based on recursive dual-construction of a symmetric base-network B. A k-level dual-construction for k > 0 creates a network RDN^k(B) containing N = (2n₀)^2k/2 nodes with node-degree d₀ + k, where n₀ and d₀ are the number of nodes and the node-degree of the base network, respectively. The RDN is a symmetric graph and can contain huge number of nodes with small node-degree and short diameter. Node-to-set disjoint-paths routing is fundamental and has many applications for fault-tolerant and secure communications in a network. In this paper, we propose an efficient algorithm for node-to-set disjoint-paths routing in RDN. We show that, given a node s and a set of d₀ + k nodes T in RDN^k(B), d₀ + k disjoint paths, each connecting s to a node in T, can be found in O(((d₀ + k)D₀/ lg n₀) lg N) time, and the length of the paths is at most 3(D₀/2+1)(lg N +1)/(lg n₀+1), where N is the number of nodes in RDN^k(B), d₀ , D₀, and n₀ are the node-degree, the diameter, and the number of nodes of base-network B, respectively.

抄録全体を表示

PDF形式でダウンロード (167K)
Specialized Multicore Architectures Supporting Efficient Multi-Agent Simulations

Christian Schäck, Rolf Hoffmann, Wolfgang Heenes

2011 年1 巻2 号 p. 191-210
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_191

ジャーナルフリー

抄録を表示する抄録を非表示にする

Two new multiprocessor architectures to accelerate the simulation of multi-agent systems based on the massively parallel GCA (Global Cellular Automata) model are presented. The GCA model is suited to describe and simulate different multi-agent systems. The designed and implemented architectures mainly consist of a set of processors (NIOS II) and a network. The multiprocessor systems allow the implementation in a flexible way through programming, thus simulating different behaviors on the same architecture. Two architectures, one with up to 16 processors, were implemented on an FPGA. The first architecture uses hardware hash functions in order to reduce the overall simulation time, but lacks scalability. The second architecture uses an agent memory and a cell field memory. This improves the scalability and further increases the performance.

抄録全体を表示

PDF形式でダウンロード (419K)
GPU-based SoftAssign for Maximizing Image Utilization in Photomosaics

Marcos Slomp, Michihiro Mikamo, Bisser Raytchev, Toru Tamaki, Kazufumi ...

2011 年1 巻2 号 p. 211-229
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_211

ジャーナルフリー

抄録を表示する抄録を非表示にする

Photomosaic generation is a popular non-photorealistic rendering technique, where a single image is assembled from several smaller ones. Visual responses change depending on the proximity to the photomosaic, leading to many creative prospects for publicity and art. Synthesizing photomosaics typically requires very large image databases in order to produce pleasing results. Moreover, repetitions are allowed to occur which may locally bias the mosaic. This paper provides alternatives to prevent repetitions while still being robust enough to work with coarse image subsets. Three approaches were considered for the matching stage of photomosaics: a greedy-based procedural algorithm, simulated annealing and SoftAssign. It was found that the latter delivers adequate arrangements in cases where only a restricted number of images is available. This paper introduces a novel GPU-accelerated SoftAssign implementation that outperforms an optimized CPU implementation by a factor of 60 times in the tested hardware.

抄録全体を表示

PDF形式でダウンロード (11260K)
A Uniform Platform to Support Multigenerational GPUs for High Performance Stream-based Computing

Pablo Lamilla Álvarez, Shinichi Yamagiwa, Masahiro Arai, Koichi Wada

2011 年1 巻2 号 p. 230-243
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_230

ジャーナルフリー

抄録を表示する抄録を非表示にする

GPU-based computing has become one of the popular high performance computing fields. The field is called GPGPU. This paper is focused on design and implementation of a uniform GPGPU application that is optimized for both the legacy and the recent GPU architectures. As a typical example of such the GPGPU application, this paper will discuss the uniform implementation of the Caravela platform. Especially the flow-model execution mechanism will be considered referring the recent GPU architectures. To verify the design and the implementation on CUDA and OpenCL platform, this paper will evaluate the compatibility among the architectures, and also test measurements of performance.

抄録全体を表示

PDF形式でダウンロード (299K)
An Efficient Path Setup for a Hybrid Photonic Network-on-Chip

Cisse Ahmadou Dit ADI, Hiroki Matsutani, Michihiro Koibuchi, Hidetsugu ...

2011 年1 巻2 号 p. 244-259
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_244

ジャーナルフリー

抄録を表示する抄録を非表示にする

Electrical network-on-chip (NoC) faces critical challenges in meeting the high performance and low power consumption requirements for future multicore processors interconnection. Recent tremendous advances in CMOS compatible optical components give the potential for photonics to deliver an efficient NoC performance at an acceptable energy cost. However, the lack of in flight processing and buffering of optical data made the realization of a fully optical NoC complicated. A hybrid architecture which uses optical high bandwidth transfer and an electrical control network can take advantage of both interconnection methods to offer an efficient performance-per-watt infrastructure to connect multicore processors and system-on-chip (SoC). In this paper, we propose a predictive switching and a reservation based path setup techniques to reduce the path setup latency of such hybrid photonic network-on-chip (HPNoC). By using these techniques, it is possible to reduce the latency for end-to-end communication in a HPNoC improving its overall performance. In the simulation, we use a cycle accurate simulator under uniform, neighbor, and bitreversal traffic patterns for a 64-node torus topology. The results show that the proposed techniques considerably improve the overall latency of HPNoC.

抄録全体を表示

PDF形式でダウンロード (195K)
Implementations of a Parallel Algorithm for Computing Euclidean Distance Map in Multicore Processors and GPUs

Duhu Man, Kenji Uda, Hironobu Ueyama, Yasuaki Ito, Koji Nakano

2011 年1 巻2 号 p. 260-276
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_260

ジャーナルフリー

抄録を表示する抄録を非表示にする

Given a 2-D binary image of size n×n, Euclidean Distance Map (EDM) is a 2-D array of the same size such that each element is storing the Euclidean distance to the nearest black pixel. It is known that a sequential algorithm can compute the EDM in O(n²) and thus this algorithm is optimal. Also, work-time optimal parallel algorithms for shared memory model have been presented. However, the presented parallel algorithms are too complicated to implement in existing shared memory parallel machines. The main contribution of this paper is to develop a simple parallel algorithm for the EDM and implement it in two different parallel platforms: multicore processors and Graphics Processing Units (GPUs). We have implemented our parallel algorithm in a Linux server with four Intel hexad-core processors (Intel Xeon X7460 2.66GHz). We have also implemented it in the following two modern GPU systems, Tesla C1060 and GTX 480, respectively. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation in the multicore system achieves a speedup factor of 18 over the performance of a sequential algorithm using a single processor in the same system. Meanwhile, for the same input binary image, our implementation on the GPU achieves a speedup factor of 26 over the sequential algorithm implementation.

抄録全体を表示

PDF形式でダウンロード (312K)
An RSA Encryption Hardware Algorithm using a Single DSP Block and a Single Block RAM on the FPGA

Song Bo, Kensuke Kawakami, Koji Nakano, Yasuaki Ito

2011 年1 巻2 号 p. 277-289
発行日: 2011年
公開日: 2017/03/23

DOIhttps://doi.org/10.15803/ijnc.1.2_277

ジャーナルフリー

抄録を表示する抄録を非表示にする

The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048-bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex-6 family FPGA. The implementation results showed that our RSA module for 2048-bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.

抄録全体を表示

PDF形式でダウンロード (206K)

J-STAGEへの登録はこちら（無料）