The 12th Workshop on Advances in Parallel and Distributed Computational Models (APDCM) – held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS) on April 19-23, 2010, in Atlanta, USA,– aims to provide a timely forum for the exchange and dissemination of new ideas, techniques and research in the field of the parallel and distributed computational models.The APDCM workshop has a history of attracting participation from reputed researchers world-wide. The program committee has encouraged the authors of accepted papers to submit full-versions of their manuscripts to the International Journal on Networking and Computing (IJNC) after the workshop. After a thorough reviewing process, with extensive discussions, eight articles on various topics have been selected for publication on the IJNC special issue on APDCM.On behalf of the APDCM workshop, we would like to express our appreciation for the large efforts of reviewers who reviewed papers submitted to the special issue. Likewise, we thank all the authors for submitting their excellent manuscripts to this special issue. We also express our sincere thanks to the editorial board of the International Journal on Networking and Computing, in particular, to the Editor-in-chief Professor Koji Nakano. This special issue would not have been possible without his support.
We present a mapping of the well known Nagel-Schreckenberg algorithm for traffic simulation onto the Global Cellular Automata (GCA) model using agents. The GCA model consists of a collection of cells which change their states synchronously depending on the states of their neighbors like in the classical CA (Cellular Automata) model. In contrast to the CA model the neighbors can be freely and dynamically selected at runtime. The vehicles are considered as agents that are modeled as GCA cells. An agent is linked to its agent in front, and an empty cell is linked to its agent behind. In the current generation t the position of an agent is already computed for the generation t+2. Thereby the agents movements and all cell updates can directly be calculated as defined by the cell rule. No searching of specific cells during the computation is necessary. Compared to an optimized CA algorithm (with searching for agents) the GCA algorithm executes significantly faster, especially for low traffic densities and high vehicle speeds. Simulating one lane with a density of 10% on an FPGA multiprocessor system resulted in a speed-up (measured in clock ticks) of 14.75 for a system with 16 NIOS II processors.
It is possible to implement the parallel random access machine (PRAM) on a chip multiprocessor (CMP) efficiently with an emulated shared memory (ESM) architecture to gain easy parallel programmability crucial to wider penetration of CMPs to general purpose computing. This implementation relies on exploitation of the slack of parallel applications to hide the latency of the memory system instead of caches, sufficient bisection bandwidth to guarantee high throughput, and hashing to avoid hot spots in intercommunication. Unfortunately this solution can not handle workloads with low thread-level parallelism (TLP) efficiently because then there is not enough parallel slackness available for hiding the latency. In this paper we show that integrating non-uniform memory access (NUMA) support to the PRAM implementation architecture can solve this problem and provide a natural way for migration of the legacy code written for a sequential or multi-core NUMA machine. The obtained PRAM-NUMA hybrid model is defined and architectural implementation of it is outlined on our ECLIPSE ESM CMP framework. A high-level programming language example is given.
This paper studies the difference in computational power between the mesh-connected parallel computers equipped with dynamically reconfigurable bus systems and those with static ones. The mesh with separable buses (MSB) is the mesh-connected computer with dynamically reconfigurable row/column buses. The broadcasting buses of the MSB can be dynamically sectioned into smaller bus segments by program control. We examine the impact of reconfigurable capability on the computational power of the MSB model, and investigate how computing power of the MSB decreases when we deprive the MSB of its reconfigurability. We show that any single step of the MSB of size n × n can be simulated in O (log n) time by the MSB without its reconfigurable function, which means that the MSB of size n × n can work with O (log n) step slowdown even if its dynamic reconfigurable function is disabled.
Consider the following operation on an arbitrary positive number: if the number is even, divide it by two, and if the number is odd, triple it and add one. The Collatz conjecture asserts that, starting from any positive number m, repeated iteration of the operations eventually produces the value 1. The main contribution of this paper is to present an efficient implementation of a coprocessor that performs the exhaustive search to verify the Collatz conjecture using a Xilinx Virtex-6 FPGA with DSP blocks, each of which contains one multiplier and one adder. The experimental results show that, our coprocessor can verify 4.99 × 108 64-bit numbers per second. Also, we have implemented a multi-coprocessors system that has 380 coprocessors on the FPGA. The experimental results show that our multi-coprocessor system can verify 1.64 × 1011 64-bit numbers per second.
In virtual MIMO technology, distributed single-antenna radio systems cooperate on information transmission and reception as a multiple-antenna MIMO radio system. In this paper, a cooperative transmission scheme, virtual MIMO network formation and reconfiguration algorithms, and a cooperative routing backbone are cross-layered designed for wireless sensor networks (WSNs) to jointly achieve required reliability, energy efficiency and delay reduction. The proposed cooperative transmission scheme minimizes the number of intra communications among cooperative nodes. It can save energy and reduces latency at transmission links even when the distance between cooperative nodes is large. In the proposed routing backbone, energy consumption and latency are optimized simultaneously along the route which leverages the MIMO advantages from local transmission links into the whole network. In order to apply the virtual MIMO technology to a general WSN, the number of the cooperative nodes and the length of transmission links are allowed to have heterogeneity. The proposed virtual MIMO radio network can be formed for any underlying WSN with low reconfiguration cost. The performance evaluation shows that the proposed design can fully realize the potential of the virtual MIMO technology and largely improve reliability, latency and energy consumption in a WSN.
Detecting critical paths in traditional message passing parallel programs can be useful for post-mortem performance analysis. This paper presents an efficient online algorithm for detecting critical paths for message-driven parallel programs. Initial implementations of the algorithm have been created in three message-driven parallel languages: Charm++, Charisma, and Structured Dagger. Not only does this work describe a novel implementation of critical path detection for the message-driven programs, but also the resulting critical paths are successfully used as the program runs for automatic performance tuning. The actionable information provided by the critical path is shown to be useful for online performance tuning within the context of the message driven parallel model, whereas it has never been used for online purposes within the traditional message passing model.
Modern commodity desktop computers equipped with multi-core Central Processing Units (CPUs) and specialized but programmable co-processors are capable of providing a remarkable computational performance. However, approaching this performance is not a trivial task as it requires the coordination of architecturally different devices for cooperative execution. Coordinating the use of the full set of processing units demands careful coalescing of diverse programing models and addressing the challenges imposed by the overall system complexity.In order to exploit the computational power of a heterogeneous desktop system, such as a platform consisting of a multi-core CPU and a Graphics Processing Unit (GPU), we propose herein a collaborative execution environment that allows to cooperatively execute a single application by exploiting both task and data parallelism. In particular, the proposed environment is able to use the different native programming models according to the device type, e.g., the application processing interfaces such as OpenMP for the CPU and Compute Unified Device Architecture (CUDA) for the GPU devices. The data and task level parallelism is exploited for both types of processors by relying on the task description scheme defined by the proposed environment.The relevance of the proposed approach is demonstrated in a heterogeneous system with a quad-core CPU and a GPU for linear algebra and digital signal processing applications. We obtain significant performance gains in comparison to both single core and multi-core executions when computing matrix multiplication and Fast Fourier Transform (FFT).
Current processor architectures are diverse and heterogeneous. Examples include multicore chips, GPUs and the Cell Broadband Engine (CBE). The recent Open Compute Language (OpenCL) standard aims at efficiency and portability. This paper explores its efficiency when implemented on the CBE, without using CBE-specific features such as explicit asynchronous memory transfers. We based our experiments on two applications: matrix multiplication, and the client side of the Einstein@Home distributed computing project. Both were programmed in OpenCL, and then translated to the CBE. For matrix multiplication, we deployed different levels of OpenCL performance optimization, and observed that they pay off on the CBE. For Einstein@Home, our translated OpenCL version achieves almost the same speed as a native CBE version. We experimented with two versions of the OpenCL to CBE mapping, in which the PPE component of the CBE does or does not take the role of a compute unit. Another major contribution of the paper is a proposal for two OpenCL extensions that we analyzed for both CBE and NVIDIA GPUs. First, we suggest an additional memory level in OpenCL, called static local memory. With little programming expense, it can lead to significant speedups such as for reduction a factor of seven on the CBE and about 20% on NVIDIA GPUs. Second, we introduce static work-groups to support user-defined mappings of tasks. Static work-groups may simplify programming and lead to speedups of 35% (CBE) and 100% (GPU) for all-parallel-prefix-sums.