The 14th Workshop on Advances in Parallel and Distributed Computational Models (APDCM) - held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS) on May 21-25, 2012, in Shanghai, China, - aims to provide a timely forum for the exchange and dissemination of new ideas, techniques and research in the field of the parallel and distributed computational models.
The APDCM workshop has a history of attracting participation from reputed researchers world-wide. The program committee has encouraged the authors of accepted papers to submit full-versions of their manuscripts to the International Journal on Networking and Computing (IJNC) after the workshop. After a thorough reviewing process, with extensive discussions, six articles on various topics have been selected for publication on the IJNC special issue on APDCM.
On behalf of the APDCM workshop, we would like to express our appreciation for the large efforts of reviewers who reviewed papers submitted to the special issue. Likewise, we thank all the authors for submitting their excellent manuscripts to this special issue. We also express our sincere thanks to the editorial board of the International Journal on Networking and Computing, in particular, to the Editor-in-chief Professor Koji Nakano. This special issue would not have been possible without his support.
In the near future, improvements in semiconductor technology will allow thousands of resources to be implementable on chip. However, a limitation remains for both single large-scale processors and many-core processors. For single processors, this limitation arises from their design complexity, and regarding the many-core processors, an application is partitioned to several tasks and these partitioned tasks are mapped onto the cores. In this article, we propose a dynamic chip multiprocessor (CMP) model that consists of simple modules (realizing a low design complexity) and does not require the application partitioning since the scale of the processor is dynamically variable, looking like up or down scale on demand. This model is based on prior work on adaptive processors that can gather and release resources on chip to dynamically form a processor. The adaptive processor takes a linear topology that realizes a locality based placement and replacement using processing elements themselves through a stack shift of information on the linear topology of the processing element array. Therefore, for the scaling of the processor, a linear topology of the interconnection network has to support the stack shift before and after the up- or down-scaling. Therefore, we propose an interconnection network architecture called a dynamic channel segmentation distribution (dynamic CSD) network. In addition the linear topology must be folded on-chip into two-dimensional plane. We also propose a new conceptual topology and its cluster which is a unit of the new topology and is replicated on the chip. We analyzed the cost in terms of the available number of clusters (adaptive processors with a minimum scale) and delay in Manhattan-distance of the chip, as well as its peak Giga-Operations per Second (GOPS) across the process technology scaling.
It is expected that the first exascale supercomputer will be deployed within the next 10 years, however both its CPU architecture and programming model are not known yet. Multicore CPUs are not expected to scale to the required number of cores per node, but hybrid multicore CPUs consisting of different kinds of processing elements are expected to solve this issue. They come at the cost of increased software development complexity with e. g., missing cache coherency and on-chip NUMA effects. It is unclear whether MPI and OpenMP will scale to exascale systems and support easy development and scalable and efficient programs. One of the programming models considered as an alternative is the the so-called partitioned global address space (PGAS) model, which is targeted at easy development by providing one common memory address space across all cluster nodes. In this paper we first outline current and possible future hardware and introduce a new abstract hardware model able to describe hybrid clusters. We discuss how current shared memory, GPU and PGAS programming models can deal with the upcoming hardware challenges and describe how synchronization can generate unneeded inter- and intra-node transfers in case the memory consistency model is not optimal. As a major contribution, we introduce our variation of the PGAS model allowing implicit fine-grained pairwise synchronization among the nodes and the different kinds of processors. We furthermore offer easy deployment of RDMA transfers and provide communication algorithms commonly used in MPI collective operations, but lift the requirement of the operations to be collective. Our model is based on single assignment variables and uses a data-flow like synchronization mechanism. Reading uninitialized variables results in the reading thread to be blocked until data are made available by another thread. That way synchronization is done implicitly when data are read. Explicit tiling is used to reduce synchronization overhead and to increase cache and network utilization. Broadcast, scatter and gather are modeled based on data distribution among the nodes, whereas reduction and scan follow a combining PRAM approach of having multiple threads write to the same memory location. We discuss the Gauß-Seidel stencil, bitonic sort, FFT and a manual scan implementation in our model. We implemented a proof-of-concept library showing the usability and scalability of the model. With this library the Gauß-Seidel stencil scaled well in initial experiments on an 8-node machine and we show that it is easy to keep two GPUs and multiple cores busy when computing a scan.
Chemical computing was initially proposed as a paradigm capturing the essence of parallel programs. Within such a model, a program is envisioned as a solution of information-carrying molecules, that, at run time, collide non-deterministically to produce new information. Such a paradigm allowed the programmers to focus on the logic of the problem to be solved in parallel, without having to worry about the implementation’s considerations. Throughout the years, the model has been enriched with various features related to program structure, control and practicability. More importantly, the model has recently been raised to the higher order, increasing again its expressiveness. With the rise of service-oriented computing, such models have recently regained a lot of interest. They have been shown to provide adequate abstractions to enhance service-oriented architectures with autonomic properties such as self-adaptation, self-healing, or self-organisation.
However, the deployment of chemical programs over large-scale, distributed platforms is still a widely open problem, hindering the model to be leveraged in practice. This paper studies the possibility of building a distributed execution environment for chemical programs, to which end different approaches are discussed. Firstly, the paper envisions such a platform based on the distributed shared memory model to implement this solution. Such an approach leads to several issues that prevented us to finalise this approach and led to the exploration of a message-passing-based solution. Thus, secondly, and more importantly, a generic peer-to-peer-based runtime model is proposed.
To complete this study, a software prototype of the second approach was developed and experimented over the Grid’5000 test-bed. Experimental performance results are detailed, allowing for a discussion of the feasibility and performance of such a runtime, paving the way for future works, and lifting a barrier towards the enactment of the chemical programming model.
The paper is devoted to Time Division Multiple Access Link Scheduling Protocols in wireless sensor networks for full duplex (two-way) communication, where each sensor is scheduled on an incident link as a transmitter and as a receiver in two different time slots. We formulate the full duplex link scheduling problem (FDLSP) as distance-2 edge coloring in bi-directed graphs and prove tighter lower and upper bounds for the FDLSP problem. We formulate the FDLSP problem as an integer linear program (ILP). Then, we present two Δ-approximation distributed algorithms for growth bounded graphs (GBG), for modeling the sensor networks, and for general graphs, Δ being the maximum node degree in the network. The first algorithm is a synchronous Δ-approximation algorithm based on finding maximal independent sets. The second is an asynchronous Δ-approximation depth first search (DFS) based algorithm. The maximal independent set based algorithm requires only O(Δlog*n) communication rounds (where n is the number of processors in the network) in growth bounded graphs. For general graphs, the maximal independent set based algorithm requires O(Δ4+Δ3log*n) communication rounds, improving upon the previous best known algorithm with O(nΔ2+n2m) communication rounds (where m is the number of links in the network). The asynchronous DFS based algorithm requires only O(n) communication rounds for both general and growth bounded graphs. The simulations show that the proposed algorithms assign on average equal or fewer number of time slots compared to the best known distributed algorithm while being significantly faster.
Many image processing operations manipulate an individual pixel using the values of other pixels in the neighborhood. Such operations are called windowed operations. The size of the windowed operation is a measure of the size of the given pixel’s neighborhood. A windowed computation applies a windowed operation on all pixels of the image. An image processing application is typically a sequence of windowed computations. While windowed computations admit high parallelism, the cost of inputting and outputting the image often restricts the computation to a few computational units.
In this paper we analytically study the running of a sequence of z windowed computations, each of size w, on a z-stage pipelined computational model. For an N × N image and n × n input/output bandwidth per stage, we show that the sequence of windowed computations can be run in at most N2/n2 (1 + δ) steps, where δ= (n/N + 6n3/wN2 + zw/N + zn2/N2). This produces a speed-up of z/1+δ over a single stage; δ, the overhead is quite small. We also show that the memory requirement per stage is O(wN +n2). With values of N, n and w that reflect the current state-of-the-art, over 20 pipeline stages can be sustained with less than 5% overhead for a 10M-pixel image. Each of these stages would require less than 128 Kbytes of storage.
The main problems with current multicore architectures are that they are difficult to program due to the asynchrony of the underlying model of computation and that the performance is weak with many parallel workloads due to architectural limitations. To address these problems we have introduced the Parallel Random Access Machine - Non Uniform Memory Access (PRAM-NUMA) model of computation that can be used to implement efficient shared memory computers for general purpose parallel applications with enough parallelism and yet support sequential and NUMA legacy code and avoid loss of performance in applications with low parallelism. While programming of computers making use of the PRAM-NUMA model is provably easy, there is still room for improvement since they make implementing time-shared multitasking expensive, sometimes replicate much of the execution unnecessarily, and force the programmer to use looping and conditional control primitives in the case the application parallelism does not match the hardware parallelism. Thick Control Flow (TCF) is a parallel programming model that does not provide a fixed number of threads like PRAM-NUMA but a number of control flows that have certain thickness that can vary according to needs of the application catching the best parts of the dynamicism and generality of the original unbounded PRAM model and simplicity of the Single Instruction Stream Multiple Data Streams (SIMD) model. In this paper we study the possibility to implement the TCF model on top of the PRAM-NUMA model and propose an extended PRAM-NUMA model that makes this straightforward. A number of variants of the extended model are identified and tied to existing execution models. Architectural implementation techniques and programming of the extended model and its variants are outlined and discussed with short examples.
A self-stabilizing algorithm, after transient faults hit the system and place it in some arbitrary global state, causes the system to recover in finite time without external (e.g., human) intervention. In this paper, we give a distributed asynchronous silent self-stabilizing algorithm for finding a minimal k-dominating set of at most ⎾n/k+1⏋ processes in an arbitrary identified network of size n. We give a transformer that allows our algorithm to work under an unfair daemon, the weakest scheduling assumption. The complexity of our solution is O(n) rounds and O(Dn3) steps using O(log k + log n + k log N/k) bits per process, where D is the diameter of the network and N is an upper bound on n.
Power consumption has become a critical issue in designing computer systems. Dynamic power management is an approach that aims to reduce power consumption at system level by selectively placing components into low power states. Time-out and prediction based policies are often adopted in practical systems. However, they have to accurately determine the time in low power state and otherwise the saved power consumption is not worth the loss of performance. In this paper, a power management for multiprocessor systems is proposed to optimally reduce the power consumption of multiple processors. The key feature of the proposed power management is that how long to place a processor into low power state is determined in advance but not decided when a processor becomes idle. Thus, many off-time quanta are pre-determined beforehand. The proposed power management schedules the off-time quanta to processors and a processor is placed into low power state if an off-time quantum is assigned to it. It seems that processors execute special tasks which just reduce the power supplied to them. Hence, the off-time quanta are also named sleep tasks, which are virtual and injected into the original task traffic. By doing so, the inaccurate time length of sleep tasks hardly impacts on the performance, because if a processor is blocked by a sleep task there is another one available except that all the others are blocked at the same time. Then a probabilistic policy is also proposed to optimally assign sleep tasks from the waiting queue to the processors for minimum loss of performance. In the proposed policy, high priority is given to real tasks and sleep tasks are serviced only on necessity. The analysis of the probabilistic policy is performed on a queueing model and shows that the policy is asymptotically optimal. The proposed power management and policy are further examined in empirical studies.
Since embedded systems are used for various purposes at a variety of places, their security, reliability, and availability are important concerns. Virtualization technology has been used for a long time in order to improve the security of systems, and the other advanced features recently become available also to improve the reliability and/or availability. In order to apply virtualization technology to systems, the detailed analysis of target processor architectures is required. Although such analysis was performed for x86 architecture, it is not available for the ARM architecture, which is the most widely used processor architecture for embedded systems. This paper focuses on such analysis of the ARM processor architecture. We also present the implementation details of a virtual machine monitor (VMM) based on the analysis. We implemented a VMM, called SIVARM, for the ARM architecture to perform and verify the analysis of its ability to support a VMM. We successfully booted Linux on SIVARM and performed the evaluation by executing several benchmark programs on the ARM 1136JF-S processor. It verifies the analysis of the sensitive instructions and also enables the analysis of the performance impact of the virtualization.
To communicate over an ad hoc sensor network, many routing protocols usually collect information from the whole of the network. Thus, for transmitting a message, they consume an amount of power that is proportional to the size of the network.
Now, we restrict the problem so that messages can be transferred only in a certain direction on a two-dimensional surface. If we solve this problem by employing a protocol using only local information, the protocol would consume an amount of power that is proportional to not the size of the network, but the length of the transmission path since only the nodes near the path would consume power.
Inspired by the glider of the Game of Life cellular automaton, we propose a protocol designed to obtain information on the shape and the direction of movement of each group of nodes by using only local information. Our protocol has limited straightness for a random distribution of arrangements of nodes.