The 13th Workshop on Advances in Parallel and Distributed Computational Models (APDCM) - held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS) on May 16-20, 2011, in Anchorage, USA, - aims to provide a timely forum for the exchange and dissemination of new ideas, techniques and research in the field of the parallel and distributed computational models.
The APDCM workshop has a history of attracting participation from reputed researchers world-wide. The program committee has encouraged the authors of accepted papers to submit full-versions of their manuscripts to the International Journal on Networking and Computing (IJNC) after the workshop. After a thorough reviewing process, with extensive discussions, seven articles on various topics have been selected for publication on the IJNC special issue on APDCM.
On behalf of the APDCM workshop, we would like to express our appreciation for the large efforts of reviewers who reviewed papers submitted to the special issue. Likewise, we thank all the authors for submitting their excellent manuscripts to this special issue. We also express our sincere thanks to the editorial board of the International Journal on Networking and Computing, in particular, to the Editor-in-chief Professor Koji Nakano. This special issue would not have been possible without his support.
Sensor-actuator networks and interactive ubiquitous environments are distributed systems in which the sensor-actuators communicate with each other by message-passing. This paper makes three contributions. First, it gives a general system and execution model for such sensor-actuator networks in pervasive environments. Second, it examines the range of time models that are useful for specifying properties, and for implementation, in such distributed networks, and places approaches and limitations in perspective. Third, it shows that although the partial order time model has not been seen to be useful as a specification tool in real applications of sensornets, yet, it is useful for real applications in pervasive sensornets because (under certain conditions) it can serve as a viable alternative to physically synchronized clocks that provide the linear order time model.
This work presents a packet aggregation technique, named Holding Time Aggregation - HTA. HTA is tailored for real time applications whose data is carried over wireless network environments. At the center of HTA lies an elaborated packet holding time estimation, which makes HTA to be highly adaptable to the diverse link conditions of a wireless setting. Contrary to other proposals that consider fixed packet retention time, the proposed HTA uses an adaptable packet retention time to allow relay nodes to explore aggregation opportunities on a multi-hop path. The proposed mechanism was evaluated and compared to another prominent packet aggregation scheme. Simulation results have shown that the proposed mechanism is capable to keep jitter and total delay within application limits. Furthermore, HTA has shown to allow for substantial reduction on the number of packet transmissions as well as on the overall packet overhead. Savings in terms of packet transmissions reached nearly 80% in the evaluated scenarios. These results have shown that the proposed scheme is able to cope with varying network link capacity and strict application timing requirements. The empirical results have shown to be consistent with the analytical results.
The Kernel Polynomial Method (KPM) is one of the fast diagonalization methods used for simulations of quantum systems in research fields of condensed matter physics and chemistry. The algorithm has a difficulty to be parallelized on a cluster computer or a supercomputer due to the fine-grain recursive calculations. This paper proposes an implementation of the KPM on the recent graphics processing units (GPU) where the recursive calculations are able to be parallelized in the massively parallel environment. This paper also describes performance evaluations regarding the cases when the actual simulation parameters are applied, where one parameter is applied for the increased intensive calculations and another is applied for the increased amount of memory usage. Moreover, the impact for applying the Compress Row Storage (CRS) format to the KPM algorithm is also discussed. Finally, it concludes that the performance on the GPU promises very high performance compared to the one on CPU and reduces the overall simulation time.
The past two decades have witnessed a revolution in the use of electronic devices in our daily activities. Increasingly, such activities involve the exchange of personal and sensitive data by means of portable and light weight devices. This implied the use of security applications in devices with tight processing capability and low power budget. Current architectures for processors that run security applications are optimized for either high-performance or low energy consumption. We propose an implementation for an architecture that not only provides high performance and low energy consumption but also mitigates security attacks on the cryptographic algorithms which are running on it. The proposed architecture of the Globally-Asynchronous Locally-Synchronous-based Low Power Security Processor (GALS-based LPSP) inherits the scheduling freedom and high performance from the dataflow architectures and the low energy consumption and flexibility from the GALS systems. In this paper, a prototype of the GALS-based LPSP is implemented as a soft core on the Virtex-5 (xc5-vlx155t) FPGA. The architectural features that allow the processor to mitigate Side-Channel attacks are explained in detail and tested on the current encryption standard, the AES. The performance analysis reveals that the GALS-based LPSP achieves two times higher throughput with one and a half times less energy consumption than the currently used embedded processors.
One of the key points of success in high performance computation using an FPGA is the efficient usage of DSP slices and block RAMs in it. This paper presents a FDFM (Few DSP slices and Few block RAMs) processor core approach for implementing RSA encryption. In our approach, an efficient hardware algorithm for Chinese Remainder Theorem (CRT) based RSA decryption using Montgomery multiplication algorithm is implemented. Our hardware algorithm supporting up-to 2048-bit RSA decryption is designed to be implemented using one DSP slice, one block RAM and few logic blocks in the Xilinx Virtex-6 FPGA. The implementation results show that our RSA core for 1024-bit RSA decryption runs in 13.74ms. Quite surprisingly, the multiplier in the DSP slice used to compute Montgomery multiplication works in more than 95% clock cycles during the processing. Hence, our implementation is close to optimal in the sense that it has only less than 5% overhead in multiplication and no further improvement is possible as long as CRT-based Montgomery multiplication based algorithm is applied. We have also succeeded in implementing 320 RSA decryption cores in one Xilinx Virtex-6 FPGA XC6VLX240T-1 which work in parallel. The implemented parallel 320 RSA cores achieve 23.03 Mbits/s throughput for 1024-bit RSA decryption.
In parallel programs, the threads of a given application must cooperate in order to accomplish the required computation. However, the communication time between the tasks may be different depending on which core they are executing and how the memory hierarchy and interconnection are used. The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. In this context, thread and data mapping are techniques that provide performance gains by improving the use of resources such as interconnections, main memory and cache memory. The problem of detecting the best mapping is considered NP-Hard. Furthermore, in shared memory environments, there is an additional difficulty of finding the communication pattern, which is implicit and occurs through memory accesses. Our mechanism provides static mapping on NUMA architectures which does not require any prior knowledge of the application by the programmer. To obtain the mapping, different metrics were adopted and an heuristic method based on the Edmonds matching algorithm was used. In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) running on two modern multi-core NUMA machines. Results show performance gains of up to 75% compared to the native Linux scheduler and memory allocator.
We present a new model for distributed shared memory systems, based on remote data accesses. Such features are offered by network interface cards that allow one-sided operations, remote direct memory access and OS bypass. This model leads to new interpretations of distributed algorithms allowing us to propose an innovative detection technique of race conditions only based on logical clocks. Indeed, the presence of (data) races in a parallel program makes it hard to reason about and is usually considered as a bug.
GPU exhibits the capability for applications with a high level of parallelism despite its low cost. The support of integer and logical instructions by the latest generation of GPUs enables us to implement cipher algorithms more easily. However, decisions such as parallel processing granularity and memory allocation impose a heavy burden on programmers. Therefore, this paper presents results of several experiments that were conducted to elucidate the relation between memory allocation styles of variables of AES and granularity as the parallelism exploited from AES encoding processes using CUDA with an NVIDIA GeForce GTX285 (Nvidia Corp.). Results of these experiments showed that the 16 bytes/thread granularity had the highest performance. It achieved approximately 35 Gbps throughput. It also exhibited differences of memory allocation and granularity effects around 2%–30% for performance in standard implementation. It shows that the decision of granularity and memory allocation is the most important factor for effective processing in AES encryption on GPU. Moreover, implementation with overlapping between processing and data transfer yielded 22.5 Gbps throughput including the data transfer time.