The Fourth International Symposium on Networking and Computing (CANDAR 2016) was held in Hiroshima, Japan, from November 22nd to 25th, 2016. The organizers of the CANDAR 2016 invited authors to submit the extended version of the presented papers. As a result, 28 articles have been submitted to this special issue. This issue includes the extended version of 17 papers that have been accepted.This issue owes a great deal to a number of people who devoted their time and expertise to handle the submitted papers. In particular, I would like to thank the guest editors for the excellent review process: Professor Ryusuke Egawa, Professor Akihiro Fujiwara, Professor Jose Gracia, Professor Katsunobu Imai, Professor Yasuaki Ito, Professor Yoshiaki Kakuda, Professor Michihiro Koibuchi, Professor Susumu Matsumae, Professor Toru Nakanishi, Professor Yasuyuki Nogami, Professor Satoshi Ohzahata, and Professor Tomoaki Tsumura.Words of gratitude are also due to the anonymous reviewers who carefully read the papers and provided detailed comments and suggestions to improve the quality of the submitted papers. This special issue would not have been without their efforts.
Oblivious RAM (ORAM) is a technique to hide the access pattern of data to untrusted memory along with their contents. Path ORAM is a recent lightweight ORAM protocol, whose derived access pattern involves some redundancy that can be removed without the loss of security. This paper presents last path caching, which removes the redundancy of Path ORAM with a simpler protocol than an existing method called Fork Path ORAM. By combining Delay and Reuse schemes, the performance of our technique was comparable with Fork Path ORAM. According to our evaluation with a prototyped FPGA implementation, the number of LUTs used with the last path caching was 1.4%-7.8% smaller than Fork Path ORAM.
Linear cellular automata have many invariant measures in general. There are several studies on their rigidity: The unique invariant measure with a suitable non-degeneracy condition (such as positive entropy or mixing property for the shift map) is the uniform measure - the most natural one. This is related to study of the asymptotic randomization property: Iterates starting from a large class of initial measures converge to the uniform measure (in Cesaro sense). In this paper we consider one-dimensional linear cellular automata with neighborhood of size two, and study limiting distributions starting from a class of shift-invariant probability measures. In the two-state case, we characterize when iterates by addition modulo 2 cellular automata starting from a convex combination of strong mixing probability measures can converge. This also gives all invariant measures inside the class of those probability measures. We can obtain a similar result for iterates by addition modulo an odd prime number cellular automata starting from strong mixing probability measures.
Nowadays, the individual nodes of a distributed parallel computer consist of multi- or many-core processors allowing to execute more than one process per node. The large difference in communication speed within a node through shared memory, versus across nodes through the network interconnect, requires to use locality-aware communication schemes for any efficient distributed application. However, writing an efficient locality-aware MPI code is complex and error-prone, because the developer has to use very different APIs for communication operations within and across nodes, respectively, and manage inter-process synchronization. In this paper, we analyze and enhance a recent one-sided communication model, namely DART-MPI, which is implemented on top of MPI-3. In this runtime system, the complexities of handling locality-awareness of MPI memory access operations, either remote or local, and the related synchronization calls are hidden inside the related DART-MPI interfaces resulting in concise code and improved application and developer productivity. We have carried out in-depth evaluation of our DART-MPI system. Foremost, a micro benchmark is conducted to help understanding the prime performance overhead of implementing APIs in DART-MPI system, which is small and becomes negligible with the growing message sizes. We then compare the performance of DART-MPI and flat MPI without locality awareness, in particular blocking and non-blocking memory operations, using a realistic scientific application on a large-scale supercomputer. The comparison demonstrates that in most cases the DART-MPI version of this application shows better performance than the flat MPI version. Further, we compare the DART-MPI version to a functionally equivalent MPI version, which thus includes code to deal with data-locality, and show that DART-MPI realizes almost the full potential of highly optimized MPI while maintaining high productivity for non-expert programmers.
The non-von Neumann computer architecture has been widely studied to prepare us for the post-Moore era. The authors implemented this kind of architecture, which finds the lower energy state of the Ising model using circuit operations inspired by simulated annealing in SRAM-based integrated circuits. Our previous prototype was suited for the Ising model because of its simple and typical structure such as its three-dimensional lattice topology, but it could not be used in real world applications. A reconfigurable prototyping environment is needed to develop the architecture and to make it suitable for applications.
Here, we describe an FPGA-based prototyping environment to develop the annealing processor's architecture for the Ising model. We implemented the new architecture using a prototyping environment. The new architecture performs approximated simulated annealing for the Ising model, and it supports a highly complex topology. It consists of units having fully-connected multiple spins. Multiple units are placed in a two-dimensional lattice topology, and neighboring units are connected to perform interactions between spins. The number of logic elements was reduced by sharing the operator among multiple spins within the unit. Furthermore, a pseudo-random number generator, which produces random pulse sequences for annealing, is also shared among all the units. As a result, the number of logic elements was reduced to less than 1/10, and the solution accuracy became comparable to that of a conventional computer's simulated annealing.
Numerous TOP500 supercomputers are based on a torus interconnection network. The torus topology is effectively one of the most popular interconnection networks for massively parallel systems due to its interesting topological properties such as symmetry and simplicity. For instance, the world-famous supercomputers Fujitsu K, IBM Blue Gene/L, IBM Blue Gene/P and Cray XT3 are all torus-based. In this paper, we propose an algorithm that constructs 2n mutually node-disjoint paths from a set S of 2n source nodes to a set D of 2n destination nodes in an n-dimensional k-ary torus Tn,k(n≥1, k≥3). This algorithm is then formally evaluated. We have proved that the paths selected by the proposed algorithm have lengths at most 2(k+1)n and can be obtained with a time complexity of O(kn3n3 log n).
Recently, an IEEE 802.11n access point (AP) prevailed over the wireless local area network (WLAN) due to the high-speed data transmission using the multiple input multiple output (MIMO) technology. Unfortunately, the signal propagation from the 802.11n AP is not uniform in the circumferential and height directions because of the multiple antennas for MIMO. As a result, the data transmission speed between the AP and a host could be significantly affected by their relative setup conditions. In this paper, we propose a minimax approach for optimizing the 802.11n AP setup condition in terms of the angles and the height in an indoor environment using throughput measurements. First, we detect a bottleneck host that receives the weakest signal from the AP in the field using the throughput estimation model. To explore optimal values of parameters for this model, we adopt the versatile parameter optimization tool. Then, we optimize the AP setup by changing the angles and the height while measuring throughput. For evaluations, we verify the accuracy of the model using measurement results and confirm the throughput improvements for hosts in the field by our approach.
Stream compaction, also known as stream filtering or selection, produces a smaller output array which contains the indices of the only wanted elements from the input array for further processing. With the tremendous amount of data elements to be filtered, the performance of selection is of great concern. Recently, modern Graphics Processing Units (GPUs) have been increasingly used to accelerate the execution of massively large, data parallel applications. In this paper, we designed and implemented two new algorithms for stream compaction on GPU. The first algorithm, which can preserve the relative order of the input elements, uses a multi-level prefix-sum approach. The second algorithm, which is non-order-preserving, is based the hybrid use of the prefix-sum and the atomics approaches. We compared their performance with other parallel selection algorithms on the current generation of NVIDIA GPUs. The experimental results show that both algorithms run faster than Thrust, an open-source parallel algorithms library. Furthermore, the hybrid method performs the best among all existing selection algorithms on GPU and can be two orders of magnitude faster than the sequential selection on CPU, especially when the data size is large.
The main contribution of this paper is to present an efficient GPU implementation of bulk computation of eigenvalues for many small, non-symmetric, real matrices. This work is motivated by the necessity of such bulk computation in designing of control systems, which requires to compute the eigenvalues of hundreds of thousands non-symmetric real matrices of size up to 30x30. Several efforts have been devoted to accelerating the eigenvalue computation including computer languages, systems, environments supporting matrix manipulation offering specific libraries/function calls. Some of them are optimized for computing the eigenvalues of a very large matrix by parallel processing. However, such libraries/function calls are not aimed at accelerating the eigenvalues computation for a lot of small matrices. In our GPU implementation, we considered programming issues of the GPU architecture including warp divergence, coalesced access of the global memory, utilization of the shared memory, and so forth. In particular, we present two types of assignments of GPU threads to matrices and introduce three memory arrangements in the global memory. Furthermore, to hide CPU-GPU data transfer latency, overlapping computation on the GPU with the transfer is employed. Experimental results on NVIDIA TITAN X show that our GPU implementation attains a speed-up factor of up to 83.50 and 17.67 over the sequential CPU implementation and the parallel CPU implementation with eight threads on Intel Core i7-6700K, respectively.
Recently, Wireless Local-Area Network (WLAN) has become prevailing as it provides flexible Internet access to users with low cost through installation of several types of access points (APs) in the network. Previously, we proposed the active AP configuration algorithm for the elastic WLAN system using heterogeneous APs, which dynamically optimizes the configuration by activating or deactivating APs based on traffic demands. However, this algorithm assumes that any active AP may use a different channel from the other ones to avoid interferences, although the number of non-interfered channels in IEEE 802.11 protocols is limited. In this paper, we propose the extension of the AP configuration algorithm to consider the channel assignment to the active APs under this limitation. Besides, AP associations of the hosts are modified to improve the network performance by averaging loads among channels. The effectiveness of our proposal is evaluated using the WIMNET simulator in two topologies. Finally, the elastic WLAN system including this proposal is implemented using Raspberry Pi for the AP. The feasibility and performance of the implementation are verified through experiments using the testbed.
Many text mining tools cannot be applied directly to documents available on web pages. There are tools for fetching and preprocessing of textual data, but combining them with the data processing tool into one working tool chain can be time consuming. The preprocessing task is even more labor-intensive if documents are located on multiple remote sources with different storage formats.
In this paper, we propose the simplification of data preparation process for cases when data come from wide range of web resources. We developed an open-source tool, called Kayur, that greatly minimizes time and effort required for routine data preprocessing steps, allowing to quickly proceed to the main task of data analysis. The datasets generated by the tool are ready to be loaded into a data mining workbench, such as WEKA or Carrot2, to perform classification, feature prediction, and other data mining tasks.
Because of the widespread adoption of mobile devices, many applications have provided support for wireless LAN (WLAN). Under these circumstances, one of the important issues is to provide good quality of service (QoS) in WLAN. For this purpose, Dhurandher et al. improved the distributed coordination function (DCF). In this method, the contention window (CW) is divided into multiple ranges. Each range is independent of all other ranges and is assigned to a different priority. Although the highest-priority throughput increased using this method, throughput for the other priorities decreased significantly. To overcome this problem, this paper proposes a minimum contention window control method for two (high and low) priorities. In the method, all nodes are assumed to use real-time applications or data transmission. The former real-time frames are high priority and are sent by UDP. The latter data frames are low priority and are sent by TCP. The purpose of the proposed method is not only to provide good QoS for the highest priority but also to prevent deterioration in the QoS for other priorities in WLAN. For this purpose, the proposed method keeps the CW for the high priority at a low value and controls the CW for the low priority based on the collision history. Finally, the network simulations demonstrated that the proposed method reduces the decrease in the average total throughput of the low priority frames as well as reducing the packet drop rate of both priorities, compared with those for the DCF and Dhurandher's method. From a simulation scenario where there are only low priority flows in wider bandwidths, all methods give almost same average total throughput and packet drop rate, but the results also suggest that the CW range in the proposed method should be reduced to improve the average total throughput, when no congestion occurs.
What is computable with limited resources? How can we verify the correctness of computations? How to measure computational power with precision? Despite the immense scientific and engineering progress in computing, we still have only partial answers to these questions. To make these problems more precise and easier to tackle, we describe an abstract algebraic definition of classical computation by generalizing traditional models to semigroups. This way implementations are morphic relations between semigroups. The mathematical abstraction also allows the investigation of different computing paradigms (e.g. cellular automata, reversible computing) in the same framework. While semigroup theory helps in clarifying foundational issues about computation, at the same time it has several open problems that require extensive computational efforts. This mutually beneficial relationship is the central tenet of the described research.
Often, in a distributed system, a task must be performed in which all entities must be involved; however only some of them are active, while the others are inactive, unaware of the new computation that has to take place. In these situations, all entities must become active, a task known as Wake-Up. It is not difficult to see that Broadcast is just the special case of the Wake-Up problem, when there is only one initially active entity. Both problems can be solved with the same trivial but expensive solution: Flooding. More efficient broadcast protocols exist for some classes of dense interconnection networks. The research question we examine is whether also wake-up can be performed significantly better in three classes of regular interconnection networks: hypercubes, complete networks, and regular complete bipartite graphs.
In a d-dimensional hypercube network of n nodes, the cost of broadcasting is Theta(n) even if the edge labeling is arbitrary and the network is asynchronous. We show that, instead, wake-up requires Ω(nlog n) message transmissions in the worst case, even if the network is synchronous and has sense of direction. Similarly, in a regular complete bipartite network Kp,p of n=2p anonymous entities the cost of broadcasting is Theta(n) even if the edge labeling is arbitrary and the network is asynchronous; instead, we show that wake-up requires Theta(n2) message transmissions in the worst case, even if the network is synchronous and has sense of direction.
In a complete network Kn of n entities, the cost of broadcasting is minimal: n-1 message transmissions suffice even if the entities are anonymous. In this paper we prove that the cost of wake-up is order of magnitude higher. In the case of anonymous entities, Ω(n2) message transmissions are needed in the worst case, even if the network is fully synchronous and has sense of direction. In the case of entities with distinct ids, Ω(nlog n) transmissions need to be performed and the bound is tight. This shows that, when the entities have Ids, Wake-Up is computationally as costly as the apparently more complex Election problem.
In this paper, we present a self-optimizing routing algorithm using only local information, in a three-dimensional (3D) virtual grid network. A virtual grid network is a well-known network model for its ease of designing algorithms and saving energy consumption. We consider a 3D virtual grid network which is obtained by virtually dividing a network into a set of unit cubes called cells. One specific node named a router is decided at each cell, and each router is connected with the routers at adjacent cells. This implies that each router can communicate with 6 routers.
We consider the maintenance of an inter-cell communication path from a source node to a destination node and propose a distributed self-optimizing routing algorithm which transforms an arbitrary given path to an optimal (shortest) one from the source node to the destination node. Our algorithm is executed at each router and uses only local information (6 hops: 3 hops each back and forward
along the given path). Our algorithm can work in asynchronous networks without any global coordination among routers.
We present that our algorithm transform any arbitrary path to a shortest path in O(|P|) synchronous rounds, where |P| is the length of the initial path, when it works in synchronous networks. Moreover, our experiments show that our algorithm converges in about |P|/2 synchronous rounds and the ratio becomes lower as |P| becomes larger.
This paper presents a real-time FPGA implementation of the posterior system state estimation in dynamic models, which is developed using particle filter algorithm. Specially, our system is constructed by parallel resampling (FO-resampling) algorithm on a stream-based architecture. To be precise, the resampling is accomplished in a valid pixel area of an input image frame while prediction and update of particles are performed in a synchronization region, thus our approach achieves realtime performance of 60 fps for VGA images, synchronized with the camera pixel throughput without using any external memory devices. Through evaluation with an object tracking benchmark video, the tradeoff relationship between tracking quality and the number of particles is analyzed to find an appropriate hardware parameters. In addition, we address improvement of resource utilization for our particle filter architecture, especially by using a higher clock frequency to reuse hardware resources in a time sharing manner. The implementation experiments reveal that the proposed approach allows the original design to be fitted in a smaller FPGA chip. However, we also demonstrate this size reduction approach has an overhead of 2.7 to 3.0 times power consumption compared to original designs with a slow clock frequency.
In the field of high performance computing, massively-parallel many-core processors such as Intel Xeon Phi coprocessors are becoming popular because they can significantly accelerate various applications. In order to efficiently parallelize applications for such many-core processors, several high-level programming models have been proposed. The de facto standard programming model mainly for shared-memory parallel processing is OpenMP. For hierarchical parallel processing, OpenMP version 4.0 or later allows programmers to create multiple thread teams. Each thread team contains a bunch of newly-created synchronizable threads. When multiple thread teams are used to execute an application, it is important to have dynamic load balancing across thread teams, since static load balancing easily encounters load imbalance across teams, and thus degrades performance. In this paper, we first motivate our work by clarifying the benefit of using multiple thread teams to execute an irregular workload on a many-core processor. Then, we demonstrate that dynamic load balancing across those thread teams has a potential of significantly improving the performance of irregular workloads on a many-core processor, with considering the scheduling overhead. Although such a dynamic load balancing mechanism has not been provided by the current OpenMP specification, the benefits of dynamic load balancing across thread teams are discussed through experiments using the Intel Xeon Phi coprocessor. We evaluate the performance gain of dynamic load balancing across thread teams using a ray tracing code. The results show that such a dynamic load balancing mechanism can improve the performance by up to 14% compared to static load balancing across teams, with considering scheduling overhead.
The emergence of various high-performance computing (HPC) systems compels users to write a code considering the characteristic of each HPC system. To describe the system-dependent information without drastic code modifications, the directive sets such as the OpenMP directive set and the OpenACC directive set are proofed to be useful. However, the code becomes complex to achieve high performance on various HPC systems because different directive sets are required for various HPC systems. Thus, the code-maintainability and readability are degraded. This paper proposes a directive generation approach that generates various kinds of directive sets using user-defined rules. Instead of using several kinds of directive sets, users only have to write special placeholders that are utilized to specify a unique code pattern where several directives are inserted. Then, the special placeholders trigger the generation of appropriate directives for each system using a user-defined rule with a code transformation framework Xevolver. Because only special placeholders are inserted in the code, the proposed approach can keep the code-maintainability and readability. From the performance evaluations of directive-based implementations on various HPC systems, it is shown that the best implementation is different among the HPC systems. Then, through the demonstration of transformation into multiple kinds of implementations, the proposed approach can successfully generate directives from a smaller number of special placeholders. Therefore, it is clarified that the proposed directive generation approach is effective to keep the maintainability of a code to be executed on various HPC systems.
Most High Frequency (HF) communications systems deployed on the field today implement Automatic Link Establishment (ALE) techniques in order to help the HF stations automatically set up a link with good properties. Two generations (so called 2G and 3G ALE) have been standardized since the 90's, and are today being revisited due to the emergence of wideband HF waveforms. In this paper, we develop Markovian models of the 2G ALE procedure, which is nowadays the most widely used as it can operate while being completely asynchronous. Our models are "channel oriented", i.e., they observe the system from channel occupation perspective regardless of node status. We show by comparison with high-level OMNET++ simulations that our models provide fast and accurate estimation of all performance parameters of interest, and capture the main characteristics of the ALE process and the interactions between their numerous parameters. We believe that our work constitutes a useful tool to help operator plan and dimension HF networks. We also exploit the model to give some insight on the limitations of current 2G ALE, helping the design of future ALE strategies.