The 15th Workshop on Advances in Parallel and Distributed Computational Models (APDCM) - held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS) on May 20-24, 2013, in Boston, USA, - aims to provide a timely forum for the exchange and dissemination of new ideas, techniques and research in the field of the parallel and distributed computational models.
The APDCM workshop has a history of attracting participation from reputed researchers world-wide. The program committee has encouraged the authors of accepted papers to submit full-versions of their manuscripts to the International Journal of Networking and Computing (IJNC) after the workshop. After a thorough reviewing process, with extensive discussions, eleven articles on various topics have been selected for publication on the IJNC special issue on APDCM.
On behalf of the APDCM workshop, we would like to express our appreciation for the large efforts of reviewers who reviewed papers submitted to the special issue. Likewise, we thank all the authors for submitting their excellent manuscripts to this special issue. We also express our sincere thanks to the editorial board of the International Journal of Networking and Computing, in particular, to the Editor-in-chief Professor Koji Nakano. This special issue would not have been possible without his support.
In the present paper, we consider fully asynchronous parallelism in membrane computing and propose asynchronous P systems for the following four graph problems: minimum coloring, maximum independent set, minimum vertex cover, and maximum clique. We first propose an asynchronous P system that solves the minimum graph coloring for a graph with n nodes and show that the proposed P system works in O(nn+2) sequential steps or O(n2) parallel steps by using O(n2) kinds of objects. Second, we propose an asynchronous P system that solves the maximum independent set for a graph with n nodes and show that the proposed P system works in O(n2 ・ 2n) sequential steps or O(n2) parallel steps by using O(n2) kinds of objects. We next propose two asynchronous P systems that solve the minimum vertex cover and the maximum clique for the same input graph by reduction to the maximum independent set and show that the proposed P system works in O(n2 ・ 2n) sequential steps or O(n2) parallel steps by using O(n2) kinds of objects.
Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach based upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé , with the non-blocking algorithm of Ni, Meneses and Kalé  in terms of both performance and risk. We also extend the model proposedcan provide a better efficiency in [23, 15] to assess the impact of the overhead associated to non-blocking communications. In addition, we deal with arbitrary failure distributions (as opposed to uniform distributions in ). We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work without additional memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
Token circulation is a fundamental task in the distributed systems. In this paper, we propose a constant space randomized self-stabilizing master-slave token circulation algorithm that works for undirected rings and undirected unicyclic graphs of arbitrary size. We consider the recently introduced and studied master-slave model where a single node is designated to be a master node and other nodes are anonymous slave nodes. The expected stabilization time is O(n log n) steps, and the space requirement at each node is 4 bits for any undirected ring (or unicyclic graph) of size n under an unfair distributed daemon; the nodes do not need the knowledge of the size of the ring and hence the protocol is suited for dynamic graphs. The proposed token circulation algorithm is further extended to achieve orientation in the ring and the unicyclic graph. Disregarding the time for stabilization, the orientation can be done in at most O(n) steps with 1 bit extra storage at each node for the ring and the unicyclic graph.
This paper investigates the benefits that cooperative communication brings to cognitive radio networks. We focus on cooperative Multiple Input Multiple Output (MIMO) technology, where multiple distributed single-antenna secondary users cooperate on data transmission and reception. Three cooperative MIMO paradigms are proposed to maximize the diversity gain and significantly improve the performance of overlay, underlay and interweave systems. In the paradigm for overlay systems the secondary users can assist (relay) the primary transmissions even when they are far away from the primary users. In the paradigm for underlay systems the secondary users can share the primary users' frequency resources without any knowledge about the primary users' signals. The transmitted spectral density of the secondary users falls below the noise floor at the primary receivers to meet the strict interference constraint in cognitive radio networks. In the paradigm for interweave systems, secondary users can use cooperative beamforming to avoid the interference at the primary users while still achieving high diversity gain for improved system performance. Numerical and experimental results are provided in order to discuss the advantages and limits of the proposed paradigms.
Energy consumption has become a critical factor constraining the design of massively parallel computers, necessitating the development of new models and energy-efficient algorithms. The primary component of on-chip energy consumption is data movement, and the mesh computer is a natural model of this, explicitly taking distance into account. Unfortunately the dark silicon problem increasingly constrains the number of bits which can be moved simultaneously. For sorting, standard mesh algorithms minimize time and total data movement, and hence constraining the mesh to use only half its processors at any instant must double the time. It is anticipated that on-chip optics will be used to minimize the energy needed to move bits, but they have constraints on their layout. In an abstract model, we show that a pyramidal layout and a new power-aware algorithm allows one to sort with only a square root increase in time as the fraction of processors simultaneously powered decreases. Furthermore, this layout is shown to be optimal in terms of the time-power tradeoff required for sorting. Previous algorithms assumed fully powered systems, hence pyramid sorting was of no interest since when fully powered they are no faster than the base mesh. Our results show asymptotic theoretical limits of computation and energy usage on a model which takes physical constraints and developing interconnection technology into account.
Sharing the Semantic Web data encoded in Resource Description Framework (RDF) triples from proprietary datasets scattered around the Internet, calls for efficient support from distributed computing technologies. The highly dynamic ad-hoc settings that would be pervasive for Semantic Web data sharing among personal users in the future, however, pose even more demanding challenges for the enabling technologies. We extend previous work on a hybrid peer-to-peer (P2P) architecture for an ad-hoc Semantic Web data sharing system which better models the data sharing scenario by allowing data to be maintained by its own providers and exhibits satisfactory scalability owing to the adoption of a two-level distributed index and hashing techniques. Additionally, we propose efficient, scalable decentralized processing of SPARQL Protocol and RDF Query Language (SPARQL) queries in such a context and explore optimization techniques that build upon distributed query processing for database systems and relational algebra optimization. The effectiveness and efficiency of the SPARQL query processing mechanism we proposed for a decentralized settings were verified through a series of experiments. We anticipate that our work will become an indispensable, complementary approach to making the Semantic Web a reality by delivering efficient data sharing and reusing in an ad-hoc environment.
In this paper we discuss energetic complexity aspects of k-Selection protocols for a single-hop radio network (that is equivalent to Multiple Access Channel model). The aim is to grant each of k activated stations exclusive access to the communication channel. We consider both deterministic as well as randomized model. Our main goal is to investigate relations between minimal time of execution (time complexity) and energy consumption (energetic complexity). We present lower bound for energetic complexity for some classes of protocols for k-Selection (both deteministic and randomized). We also present and analyse several randomized protocols efficient in terms of both time and energetic complexity.
In the last few years, the development of programming languages for general purpose computing on Graphic Processing Units (GPUs) has led to the design and implementation of fast parallel algorithms for this architecture for a large spectrum of applications. Given the streaming-processing characteristics of GPUs, most practical applications consist of tasks that admit highly data-parallel algorithms. Many problems, however, allow for task-parallel solutions or a combination of task and data-parallel algorithms. For these, a hybrid CPU-GPU parallel algorithm that combines the highly parallel stream-processing power of GPUs with the higher scalar power of multi-cores is likely to be superior. In this paper we describe a generic translation of any recursive sequential implementation of a divide-and-conquer algorithm into an implementation that benefits from running in parallel in both multi-cores and GPUs. This translation is generic in the sense that it requires little knowledge of the particular algorithm. We then present a schedule and work division scheme that adapts to the characteristics of each algorithm and the underlying architecture, efficiently balancing the workload between GPU and CPU. Our experiments show a 4.5x speedup over a single core recursive implementation, while demonstrating the accuracy and practicality of the approach.
In this paper, we consider the problem of decontaminating a network from a black virus (BV) using a team of mobile system agents. The BV is a harmful process which, like the extensively studied black hole (BH), destroys any agent arriving at the network site where it resides; when that occurs, unlike a black hole which is static by definition, a BV moves, spreading to all the neighbouring sites, thus increasing its presence in the network. If however one of these sites contains a system agent, that clone of the BV is destroyed (i.e., removed permanently from the system). The initial location of the BV is unknown a priori. The objective is to permanently remove any presence of the BV from the network with minimum number of site infections (and thus casualties). The main cost measure is the total number of agents needed to solve the problem.
This problem integrates in its definition both the harmful aspects of the classical black hole search problem (where however the dangerous elements are static) with the mobility aspects of the classical intruder capture or network decontamination problem (where however there is no danger for the agents). Thus, it is the first attempt to model mobile intruders harmful not only for the sites but also for the agents.
We start the study of this problem by focusing on some important classes of interconnection networks: grids, tori, and hypercubes. For each class we present solution protocols and strategies for the team of agents, analyze their worst case complexity, and prove their optimality.
Embedded multicore processors represented by FPGAs and GPUs have lately attracted considerable attention for their potential computation ability and power consumption. Recent FPGAs have hundreds of embedded DSP slices and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 slice, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, and so on. They also have a dual-port memory with 18Kbits as a block RAM. Meanwhile, recent GPUs can be used for general purpose computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA provided by NVIDIA. The main contribution of this paper is to present two implementations of the Hough transform on the FPGA and the GPU. The first idea of the implementations is an efficient usage of DSP slices and block RAMs for FPGAs, and the shared memory for GPUs. The second idea is to partition the voting space in the Hough transform and the voting operation is performed in parallel. The implementation results show that the Hough transform for a 512×512 image with 33232 edge points can be done in 135.75μs and 637.88μs on the FPGA and the GPU, respectively. On the other hand, a conventional CPU implementation runs in 37.10ms. Thus, both implementations achieve a sufficient speed-up.
The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing and NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.