The 17th Workshop on Advances in Parallel and Distributed Computational Models (APDCM) -- held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS) on May 25-29, 2015, in Hyderabad, India, - aims to provide a timely forum for the exchange and dissemination of new ideas, techniques and research in the field of the parallel and distributed computational models.
The APDCM workshop has a history of attracting participation from reputed researchers worldwide. The program committee has encouraged the authors of accepted papers to submit full-versions of their manuscripts to the International Journal of Networking and Computing (IJNC) after the workshop. After a thorough reviewing process, with extensive discussions, five articles on various topics have been selected for publication on the IJNC special issue on APDCM.
On behalf of the APDCM workshop, we would like to express our appreciation for the large efforts of reviewers who reviewed papers submitted to the special issue. Likewise, we thank all the authors for submitting their excellent manuscripts to this special issue. We also express our sincere thanks to the editorial board of the International Journal of Networking and Computing, in particular, to the Editor-in-chief Professor Koji Nakano. This special issue would not have been possible without his support.
We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomial-time optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.
We consider highly dynamic distributed systems modelled by time-varying graphs (TVGs). We first address proof of impossibility results that often use informal arguments about convergence. We provide a general framework that formally proves the convergence of the sequence of executions of any deterministic algorithm over TVGs of any convergent sequence of TVGs. Next, we focus of the weakest class of long-lived TVGs, i.e., the class of TVGs where any node can communicate any other node infinitely often. We illustrate the relevance of our result by showing that no deterministic algorithm is able to compute various distributed covering structure on any TVG of this class. Namely, our impossibility results focus on the eventual footprint, the minimal dominating set and the maximal matching problems.
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. A sequential algorithm is oblivious if the address accessed at each time unit is independent of the input. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run on a GPU very efficiently. The main purpose of our work is to implement the bulk execution of a Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in a GPU. We first present a new efficient Euclidean algorithm that we call the Approximate Euclidean algorithm. The idea of the Approximate Euclidean algorithm is to compute an approximation of quotient by just one 64-bit division and to use it for reducing the number of iterations of the Euclidean algorithm. Unfortunately, the Approximate Euclidean algorithm is not oblivious. To show that the bulk execution of the Approximate Euclidean algorithm can be implemented efficiently in the GPU, we introduce a semi-oblivious sequential algorithms, which is almost oblivious. We show that the Approximate Euclidean algorithm can be implemented as a semi-oblivious algorithm. The experimental results show that our parallel implementation of the Approximate Euclidean algorithm for 1024-bit integers running on GeForce GTX Titan X GPU is 90 times faster than the Intel Xeon CPU implementation.
Multi-tiered transactional web applications are frequently used in IT enterprise based systems. Due to their inherent distributed nature, pre-deployment testing for high-availability and varying concurrency are important for post-deployment performance. Accurate performance modeling of such applications can help estimate values for future deployment variations as well as validate experimental results. In order to theoretically model performance of multi-tiered applications, we use queuing networks and Mean Value Analysis (MVA) models. While MVA has been shown to work well with closed queuing networks, there are particular limitations in cases where the service demands vary with concurrency. This variation of service demands for various resources (CPU, Disk, Network) is demonstrated through multiple experiments. This is further contrived by the use of multi-server queues in multi-core CPUs, that are not traditionally captured in MVA. We compare performance of a multi-server MVA model alongside actual performance testing measurements and demonstrate this deviation. Using spline interpolation of collected service demands, we show that a modified version of the MVA algorithm (called MVASD) that accepts an array of service demands, can provide superior estimates of maximum throughput and response time. Results are demonstrated over multi-tier vehicle insurance registration and e-commerce web applications. The mean deviations of predicted throughput and response time are shown to be less the 3% and 9%, respectively. Additionally, we analyze the effect of spline interpolation of service demands as a function of throughput on the prediction results. Using Chebyshev Nodes, the tradeoff between the number of test points and the spline interpolation/prediction accuracy is also studied.
In this paper we design and implement an algorithm for finding the biconnected components of a given graph. Our algorithm is based on experimental evidence that finding the bridges of a graph is usually easier and faster in the parallel setting. We use this property to first decompose the graph into independent and maximal 2-edge-connected subgraphs. To identify the articulation points in these 2-edge connected subgraphs, we again convert this into a problem of finding the bridges on an auxiliary graph.
It is interesting to note that during the conversion process, the size of the graph may increase. However, we show that this small increase in size and the run time is offset by the consideration that finding bridges is easier in a parallel setting. We implement our algorithm on an Intel i7 980X CPU running 12 threads. We show that our algorithm is on average 2.45x faster than the best known current algorithms implemented on the same platform. Finally, we extend our approach to dense graphs by applying the sparsification technique suggested by Cong and Bader in .
We investigated energy efficient and fault tolerant topologies for wireless sensor networks (WSNs), addressing the need to minimize communication distances because the energy used for communication is proportional to the 2nd to 6th power of the distance. We also investigated the energy hole phenomenon, in which non-uniform energy usage among nodes causes non-uniform lifetimes. This, in turn, increases the communication distances and results in a premature shutdown of the entire network. Because some sensor nodes in a WSN may be unreliable, it must be tolerant to faults. A routing algorithm called the “energy hole aware energy efficient communication routing algorithm” (EHAEC) was previously proposed. It solves the energy hole problem to the maximum extent possible while minimizing the amount of energy used for communication, by generating an energy efficient spanning tree. In this paper, we propose two provisioned fault tolerance algorithms: EHAEC for one-fault tolerance (EHAEC-1FT) and the active spare selecting algorithm (ASSA). EHAEC-1FT is a variation of EHAEC. It identifies redundant communication routes using the EHAEC tree and guarantees 2-connectivity (i.e., tolerates the failure of one node). The ASSA attempts to find active spare nodes for critical nodes. It uses two impact factors, α and β, which can be adjusted so that the result is either more fault tolerant or energy efficient. The spare nodes fix failures by replacing them. In our simulations, EHAEC was 3.4 to 4.8 times more energy efficient than direct data transmission, and thus extended the WSN lifetime. EHAEC-1FT outperformed EHAEC in terms of energy efficiency when fault tolerance was the most important, and a fault tolerant redundancy was created when or before a failure occurred. Moreover, we demonstrated that the ASSA was more energy efficient than EHAEC-1FT, and the effect of using different α and β.
Performance analysis and troubleshooting of cloud applications are challenging. In particular, identifying the root causes of performance problems is quite difficult. This is because profiling tools based on processor performance counters do not yet work well for an entire virtualized environment, which is the underlying infrastructure in cloud computing. In this work, we explore an approach for unified performance profiling of an entire virtualized environment by sampling only at the virtual machine monitor (VMM) level and applying common-time-based analysis across the entire virtualized environment from a VMM to all guests on a host machine. Our approach involves three parts, each with novel techniques: centralized data sampling at the VMM-level, generation of symbol maps for programs running in both guests and a VMM, and unified analysis of the entire virtualized environment with common time by the host-time-axis. We also describe the design of unified profiling for an entire virtual machine (VM) environment, and we actually implement a unified VM profiler based on hardware performance counters. Finally, our results demonstrate accurate profiling. In addition, we achieved a lower overhead than in a previous study as a result of having no additional context switches by the virtual interrupt injections into the guests during measurement.