International Journal of Networking and Computing

Special Issue on the Fourth International Symposium on Computing and Networking

Preface: Special Issue on the Fourth International Symposium on Computing and Networking

Koji Nakano

2017Volume 7Issue 2 Pages 105-
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_105

JOURNAL OPEN ACCESS

Show abstractHide abstract

The Fourth International Symposium on Networking and Computing (CANDAR 2016) was held in Hiroshima, Japan, from November 22nd to 25th, 2016. The organizers of the CANDAR 2016 invited authors to submit the extended version of the presented papers. As a result, 28 articles have been submitted to this special issue. This issue includes the extended version of 17 papers that have been accepted.This issue owes a great deal to a number of people who devoted their time and expertise to handle the submitted papers. In particular, I would like to thank the guest editors for the excellent review process: Professor Ryusuke Egawa, Professor Akihiro Fujiwara, Professor Jose Gracia, Professor Katsunobu Imai, Professor Yasuaki Ito, Professor Yoshiaki Kakuda, Professor Michihiro Koibuchi, Professor Susumu Matsumae, Professor Toru Nakanishi, Professor Yasuyuki Nogami, Professor Satoshi Ohzahata, and Professor Tomoaki Tsumura.Words of gratitude are also due to the anonymous reviewers who carefully read the papers and provided detailed comments and suggestions to improve the quality of the submitted papers. This special issue would not have been without their efforts.

View full abstract

Download PDF (45K)
A Virtual Cache for Overlapped Memory Accesses of Path ORAM

Naoki Fujieda, Ryo Yamauchi, Hiroki Fujita, Shuichi Ichikawa

2017Volume 7Issue 2 Pages 106-123
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_106

JOURNAL OPEN ACCESS

Show abstractHide abstract

Oblivious RAM (ORAM) is a technique to hide the access pattern of data to untrusted memory along with their contents. Path ORAM is a recent lightweight ORAM protocol, whose derived access pattern involves some redundancy that can be removed without the loss of security. This paper presents last path caching, which removes the redundancy of Path ORAM with a simpler protocol than an existing method called Fork Path ORAM. By combining Delay and Reuse schemes, the performance of our technique was comparable with Fork Path ORAM. According to our evaluation with a prototyped FPGA implementation, the number of LUTs used with the last path caching was 1.4%-7.8% smaller than Fork Path ORAM.

View full abstract

Download PDF (558K)
Limiting measures for addition modulo a prime number cellular automata

Masato Takei

2017Volume 7Issue 2 Pages 124-135
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_124

JOURNAL OPEN ACCESS

Show abstractHide abstract

Linear cellular automata have many invariant measures in general. There are several studies on their rigidity: The unique invariant measure with a suitable non-degeneracy condition (such as positive entropy or mixing property for the shift map) is the uniform measure - the most natural one. This is related to study of the asymptotic randomization property: Iterates starting from a large class of initial measures converge to the uniform measure (in Cesaro sense). In this paper we consider one-dimensional linear cellular automata with neighborhood of size two, and study limiting distributions starting from a class of shift-invariant probability measures. In the two-state case, we characterize when iterates by addition modulo 2 cellular automata starting from a convex combination of strong mixing probability measures can converge. This also gives all invariant measures inside the class of those probability measures. We can obtain a similar result for iterates by addition modulo an odd prime number cellular automata starting from strong mixing probability measures.

View full abstract

Download PDF (334K)
Application Productivity and Performance Evaluation of Transparent Locality-aware One-sided Communication Primitives

Huan Zhou, José Gracia

2017Volume 7Issue 2 Pages 136-153
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_136

JOURNAL OPEN ACCESS

Show abstractHide abstract

Nowadays, the individual nodes of a distributed parallel computer consist of multi- or many-core processors allowing to execute more than one process per node. The large difference in communication speed within a node through shared memory, versus across nodes through the network interconnect, requires to use locality-aware communication schemes for any efficient distributed application. However, writing an efficient locality-aware MPI code is complex and error-prone, because the developer has to use very different APIs for communication operations within and across nodes, respectively, and manage inter-process synchronization. In this paper, we analyze and enhance a recent one-sided communication model, namely DART-MPI, which is implemented on top of MPI-3. In this runtime system, the complexities of handling locality-awareness of MPI memory access operations, either remote or local, and the related synchronization calls are hidden inside the related DART-MPI interfaces resulting in concise code and improved application and developer productivity. We have carried out in-depth evaluation of our DART-MPI system. Foremost, a micro benchmark is conducted to help understanding the prime performance overhead of implementing APIs in DART-MPI system, which is small and becomes negligible with the growing message sizes. We then compare the performance of DART-MPI and flat MPI without locality awareness, in particular blocking and non-blocking memory operations, using a realistic scientific application on a large-scale supercomputer. The comparison demonstrates that in most cases the DART-MPI version of this application shows better performance than the flat MPI version. Further, we compare the DART-MPI version to a functionally equivalent MPI version, which thus includes code to deal with data-locality, and show that DART-MPI realizes almost the full potential of highly optimized MPI while maintaining high productivity for non-expert programmers.

View full abstract

Download PDF (815K)
Implementation and Evaluation of FPGA-based Annealing Processor for Ising Model by use of Resource Sharing

Chihiro Yoshimura, Masato Hayashi, Takuya Okuyama, Masanao Yamaoka

2017Volume 7Issue 2 Pages 154-172
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_154

JOURNAL OPEN ACCESS

Show abstractHide abstract

The non-von Neumann computer architecture has been widely studied to prepare us for the post-Moore era. The authors implemented this kind of architecture, which finds the lower energy state of the Ising model using circuit operations inspired by simulated annealing in SRAM-based integrated circuits. Our previous prototype was suited for the Ising model because of its simple and typical structure such as its three-dimensional lattice topology, but it could not be used in real world applications. A reconfigurable prototyping environment is needed to develop the architecture and to make it suitable for applications. Here, we describe an FPGA-based prototyping environment to develop the annealing processor's architecture for the Ising model. We implemented the new architecture using a prototyping environment. The new architecture performs approximated simulated annealing for the Ising model, and it supports a highly complex topology. It consists of units having fully-connected multiple spins. Multiple units are placed in a two-dimensional lattice topology, and neighboring units are connected to perform interactions between spins. The number of logic elements was reduced by sharing the operator among multiple spins within the unit. Furthermore, a pseudo-random number generator, which produces random pulse sequences for annealing, is also shared among all the units. As a result, the number of logic elements was reduced to less than 1/10, and the solution accuracy became comparable to that of a conventional computer's simulated annealing.

View full abstract

Download PDF (2072K)
A Set-to-Set Disjoint Paths Routing Algorithm in Tori

Keiichi Kaneko, Antoine Bossard

2017Volume 7Issue 2 Pages 173-186
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_173

JOURNAL OPEN ACCESS

Show abstractHide abstract

Numerous TOP500 supercomputers are based on a torus interconnection network. The torus topology is effectively one of the most popular interconnection networks for massively parallel systems due to its interesting topological properties such as symmetry and simplicity. For instance, the world-famous supercomputers Fujitsu K, IBM Blue Gene/L, IBM Blue Gene/P and Cray XT3 are all torus-based. In this paper, we propose an algorithm that constructs 2n mutually node-disjoint paths from a set S of 2n source nodes to a set D of 2n destination nodes in an n-dimensional k-ary torus T_n,k(n≥1, k≥3). This algorithm is then formally evaluated. We have proved that the paths selected by the proposed algorithm have lengths at most 2(k+1)n and can be obtained with a time complexity of O(kn³n³ log n).

View full abstract

Download PDF (367K)
A Minimax Approach for Access Point Setup Optimization in IEEE 802.11n Wireless Networks

Kyaw Soe Lwin, Nobuo Funabiki, Chihiro Taniguchi, Khin Khin Zaw, Md. S ...

2017Volume 7Issue 2 Pages 187-207
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_187

JOURNAL OPEN ACCESS

Show abstractHide abstract

Recently, an IEEE 802.11n access point (AP) prevailed over the wireless local area network (WLAN) due to the high-speed data transmission using the multiple input multiple output (MIMO) technology. Unfortunately, the signal propagation from the 802.11n AP is not uniform in the circumferential and height directions because of the multiple antennas for MIMO. As a result, the data transmission speed between the AP and a host could be significantly affected by their relative setup conditions. In this paper, we propose a minimax approach for optimizing the 802.11n AP setup condition in terms of the angles and the height in an indoor environment using throughput measurements. First, we detect a bottleneck host that receives the weakest signal from the AP in the field using the throughput estimation model. To explore optimal values of parameters for this model, we adopt the versatile parameter optimization tool. Then, we optimize the AP setup by changing the angles and the height while measuring throughput. For evaluations, we verify the accuracy of the model using measurement results and confirm the throughput improvements for hosts in the field by our approach.

View full abstract

Download PDF (3193K)
Efficient Algorithms for Stream Compaction on GPUs

Darius Bakunas-Milanowski, Vernon Rego, Janche Sang, Yu Chansu

2017Volume 7Issue 2 Pages 208-226
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_208

JOURNAL OPEN ACCESS

Show abstractHide abstract

Stream compaction, also known as stream filtering or selection, produces a smaller output array which contains the indices of the only wanted elements from the input array for further processing. With the tremendous amount of data elements to be filtered, the performance of selection is of great concern. Recently, modern Graphics Processing Units (GPUs) have been increasingly used to accelerate the execution of massively large, data parallel applications. In this paper, we designed and implemented two new algorithms for stream compaction on GPU. The first algorithm, which can preserve the relative order of the input elements, uses a multi-level prefix-sum approach. The second algorithm, which is non-order-preserving, is based the hybrid use of the prefix-sum and the atomics approaches. We compared their performance with other parallel selection algorithms on the current generation of NVIDIA GPUs. The experimental results show that both algorithms run faster than Thrust, an open-source parallel algorithms library. Furthermore, the hybrid method performs the best among all existing selection algorithms on GPU and can be two orders of magnitude faster than the sequential selection on CPU, especially when the data size is large.

View full abstract

Download PDF (871K)
An Efficient GPU Implementation of Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices

Hiroki Tokura, Takumi Honda, Yasuaki Ito, Koji Nakano, Mitsuya Nishino ...

2017Volume 7Issue 2 Pages 227-247
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_227

JOURNAL OPEN ACCESS

Show abstractHide abstract

The main contribution of this paper is to present an efficient GPU implementation of bulk computation of eigenvalues for many small, non-symmetric, real matrices. This work is motivated by the necessity of such bulk computation in designing of control systems, which requires to compute the eigenvalues of hundreds of thousands non-symmetric real matrices of size up to 30x30. Several efforts have been devoted to accelerating the eigenvalue computation including computer languages, systems, environments supporting matrix manipulation offering specific libraries/function calls. Some of them are optimized for computing the eigenvalues of a very large matrix by parallel processing. However, such libraries/function calls are not aimed at accelerating the eigenvalues computation for a lot of small matrices. In our GPU implementation, we considered programming issues of the GPU architecture including warp divergence, coalesced access of the global memory, utilization of the shared memory, and so forth. In particular, we present two types of assignments of GPU threads to matrices and introduce three memory arrangements in the global memory. Furthermore, to hide CPU-GPU data transfer latency, overlapping computation on the GPU with the transfer is employed. Experimental results on NVIDIA TITAN X show that our GPU implementation attains a speed-up factor of up to 83.50 and 17.67 over the sequential CPU implementation and the parallel CPU implementation with eight threads on Intel Core i7-6700K, respectively.

View full abstract

Download PDF (2499K)
A Channel Assignment Extension of Active Access-Point Configuration Algorithm for Elastic WLAN System and Its Implementation Using Raspberry Pi

Md. Selim Al Mamun, Nobuo Funabiki, Kyaw Soe Lwin, Md. Ezharul Islam, ...

2017Volume 7Issue 2 Pages 248-270
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_248

JOURNAL OPEN ACCESS

Show abstractHide abstract

Recently, Wireless Local-Area Network (WLAN) has become prevailing as it provides flexible Internet access to users with low cost through installation of several types of access points (APs) in the network. Previously, we proposed the active AP configuration algorithm for the elastic WLAN system using heterogeneous APs, which dynamically optimizes the configuration by activating or deactivating APs based on traffic demands. However, this algorithm assumes that any active AP may use a different channel from the other ones to avoid interferences, although the number of non-interfered channels in IEEE 802.11 protocols is limited. In this paper, we propose the extension of the AP configuration algorithm to consider the channel assignment to the active APs under this limitation. Besides, AP associations of the hosts are modified to improve the network performance by averaging loads among channels. The effectiveness of our proposal is evaluated using the WIMNET simulator in two topologies. Finally, the elastic WLAN system including this proposal is implemented using Raspberry Pi for the AP. The feasibility and performance of the implementation are verified through experiments using the testbed.

View full abstract

Download PDF (1336K)
Automated Dataset Construction from Web Resources with Tool Kayur

Alexander Kohan, Mitsuharu Yamamoto, Cyrille Valentin Artho

2017Volume 7Issue 2 Pages 271-294
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_271

JOURNAL OPEN ACCESS

Show abstractHide abstract

Many text mining tools cannot be applied directly to documents available on web pages. There are tools for fetching and preprocessing of textual data, but combining them with the data processing tool into one working tool chain can be time consuming. The preprocessing task is even more labor-intensive if documents are located on multiple remote sources with different storage formats. In this paper, we propose the simplification of data preparation process for cases when data come from wide range of web resources. We developed an open-source tool, called Kayur, that greatly minimizes time and effort required for routine data preprocessing steps, allowing to quickly proceed to the main task of data analysis. The datasets generated by the tool are ready to be loaded into a data mining workbench, such as WEKA or Carrot2, to perform classification, feature prediction, and other data mining tasks.

View full abstract

Download PDF (1045K)
A Minimum Contention Window Control Method for Lowest Priority Based on Collision History of Wireless LAN

Tomoki Hanzawa, Shigetomo Kimura

2017Volume 7Issue 2 Pages 295-317
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_295

JOURNAL OPEN ACCESS

Show abstractHide abstract

Because of the widespread adoption of mobile devices, many applications have provided support for wireless LAN (WLAN). Under these circumstances, one of the important issues is to provide good quality of service (QoS) in WLAN. For this purpose, Dhurandher et al. improved the distributed coordination function (DCF). In this method, the contention window (CW) is divided into multiple ranges. Each range is independent of all other ranges and is assigned to a different priority. Although the highest-priority throughput increased using this method, throughput for the other priorities decreased significantly. To overcome this problem, this paper proposes a minimum contention window control method for two (high and low) priorities. In the method, all nodes are assumed to use real-time applications or data transmission. The former real-time frames are high priority and are sent by UDP. The latter data frames are low priority and are sent by TCP. The purpose of the proposed method is not only to provide good QoS for the highest priority but also to prevent deterioration in the QoS for other priorities in WLAN. For this purpose, the proposed method keeps the CW for the high priority at a low value and controls the CW for the low priority based on the collision history. Finally, the network simulations demonstrated that the proposed method reduces the decrease in the average total throughput of the low priority frames as well as reducing the packet drop rate of both priorities, compared with those for the DCF and Dhurandher's method. From a simulation scenario where there are only low priority flows in wider bandwidths, all methods give almost same average total throughput and packet drop rate, but the results also suggest that the CW range in the proposed method should be reduced to improve the average total throughput, when no congestion occurs.

View full abstract

Download PDF (1503K)
Finite Computational Structures and Implementations: Semigroups and Morphic Relations

Attila Egri-Nagy

2017Volume 7Issue 2 Pages 318-335
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_318

JOURNAL OPEN ACCESS

Show abstractHide abstract

What is computable with limited resources? How can we verify the correctness of computations? How to measure computational power with precision? Despite the immense scientific and engineering progress in computing, we still have only partial answers to these questions. To make these problems more precise and easier to tackle, we describe an abstract algebraic definition of classical computation by generalizing traditional models to semigroups. This way implementations are morphic relations between semigroups. The mathematical abstraction also allows the investigation of different computing paradigms (e.g. cellular automata, reversible computing) in the same framework. While semigroup theory helps in clarifying foundational issues about computation, at the same time it has several open problems that require extensive computational efforts. This mutually beneficial relationship is the central tenet of the described research.

View full abstract

Download PDF (355K)
On the Cost of Waking Up

2017Volume 7Issue 2 Pages 336-348
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_336

JOURNAL OPEN ACCESS

Show abstractHide abstract

Often, in a distributed system, a task must be performed in which all entities must be involved; however only some of them are active, while the others are inactive, unaware of the new computation that has to take place. In these situations, all entities must become active, a task known as Wake-Up. It is not difficult to see that Broadcast is just the special case of the Wake-Up problem, when there is only one initially active entity. Both problems can be solved with the same trivial but expensive solution: Flooding. More efficient broadcast protocols exist for some classes of dense interconnection networks. The research question we examine is whether also wake-up can be performed significantly better in three classes of regular interconnection networks: hypercubes, complete networks, and regular complete bipartite graphs. In a d-dimensional hypercube network of n nodes, the cost of broadcasting is Theta(n) even if the edge labeling is arbitrary and the network is asynchronous. We show that, instead, wake-up requires Ω(nlog n) message transmissions in the worst case, even if the network is synchronous and has sense of direction. Similarly, in a regular complete bipartite network K_p,p of n=2p anonymous entities the cost of broadcasting is Theta(n) even if the edge labeling is arbitrary and the network is asynchronous; instead, we show that wake-up requires Theta(n²) message transmissions in the worst case, even if the network is synchronous and has sense of direction. In a complete network K_n of n entities, the cost of broadcasting is minimal: n-1 message transmissions suffice even if the entities are anonymous. In this paper we prove that the cost of wake-up is order of magnitude higher. In the case of anonymous entities, Ω(n²) message transmissions are needed in the worst case, even if the network is fully synchronous and has sense of direction. In the case of entities with distinct ids, Ω(nlog n) transmissions need to be performed and the bound is tight. This shows that, when the entities have Ids, Wake-Up is computationally as costly as the apparently more complex Election problem.

View full abstract

Download PDF (456K)
A Self-optimizing Routing Algorithm using Local Information in a 3-dimensional Virtual Grid Network with Theoretical and Practical Analysis

Yonghwan Kim, Yoshiaki Katayama

2017Volume 7Issue 2 Pages 349-371
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_349

JOURNAL OPEN ACCESS

Show abstractHide abstract

In this paper, we present a self-optimizing routing algorithm using only local information, in a three-dimensional (3D) virtual grid network. A virtual grid network is a well-known network model for its ease of designing algorithms and saving energy consumption. We consider a 3D virtual grid network which is obtained by virtually dividing a network into a set of unit cubes called cells. One specific node named a router is decided at each cell, and each router is connected with the routers at adjacent cells. This implies that each router can communicate with 6 routers. We consider the maintenance of an inter-cell communication path from a source node to a destination node and propose a distributed self-optimizing routing algorithm which transforms an arbitrary given path to an optimal (shortest) one from the source node to the destination node. Our algorithm is executed at each router and uses only local information (6 hops: 3 hops each back and forward along the given path). Our algorithm can work in asynchronous networks without any global coordination among routers. We present that our algorithm transform any arbitrary path to a shortest path in O(|P|) synchronous rounds, where |P| is the length of the initial path, when it works in synchronous networks. Moreover, our experiments show that our algorithm converges in about |P|/2 synchronous rounds and the ratio becomes lower as |P| becomes larger.

View full abstract

Download PDF (836K)
Deep-pipelined FPGA Implementation of Real-time Object Tracking using a Particle Filter

Theint Theint Thu, Yoshiki Hayashida, Akane Tahara, Yuichiro Shibata, ...

2017Volume 7Issue 2 Pages 372-386
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_372

JOURNAL OPEN ACCESS

Show abstractHide abstract

This paper presents a real-time FPGA implementation of the posterior system state estimation in dynamic models, which is developed using particle filter algorithm. Specially, our system is constructed by parallel resampling (FO-resampling) algorithm on a stream-based architecture. To be precise, the resampling is accomplished in a valid pixel area of an input image frame while prediction and update of particles are performed in a synchronization region, thus our approach achieves realtime performance of 60 fps for VGA images, synchronized with the camera pixel throughput without using any external memory devices. Through evaluation with an object tracking benchmark video, the tradeoff relationship between tracking quality and the number of particles is analyzed to find an appropriate hardware parameters. In addition, we address improvement of resource utilization for our particle filter architecture, especially by using a higher clock frequency to reuse hardware resources in a time sharing manner. The implementation experiments reveal that the proposed approach allows the original design to be fitted in a smaller FPGA chip. However, we also demonstrate this size reduction approach has an overhead of 2.7 to 3.0 times power consumption compared to original designs with a slow clock frequency.

View full abstract

Download PDF (650K)
Toward Dynamic Load Balancing across OpenMP Thread Teams for Irregular Workloads

Xiong Xiao, Shoichi Hirasawa, Hiroyuki Takizawa, Hiroaki Kobayashi

2017Volume 7Issue 2 Pages 387-404
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_387

JOURNAL OPEN ACCESS

Show abstractHide abstract

In the field of high performance computing, massively-parallel many-core processors such as Intel Xeon Phi coprocessors are becoming popular because they can significantly accelerate various applications. In order to efficiently parallelize applications for such many-core processors, several high-level programming models have been proposed. The de facto standard programming model mainly for shared-memory parallel processing is OpenMP. For hierarchical parallel processing, OpenMP version 4.0 or later allows programmers to create multiple thread teams. Each thread team contains a bunch of newly-created synchronizable threads. When multiple thread teams are used to execute an application, it is important to have dynamic load balancing across thread teams, since static load balancing easily encounters load imbalance across teams, and thus degrades performance. In this paper, we first motivate our work by clarifying the benefit of using multiple thread teams to execute an irregular workload on a many-core processor. Then, we demonstrate that dynamic load balancing across those thread teams has a potential of significantly improving the performance of irregular workloads on a many-core processor, with considering the scheduling overhead. Although such a dynamic load balancing mechanism has not been provided by the current OpenMP specification, the benefits of dynamic load balancing across thread teams are discussed through experiments using the Intel Xeon Phi coprocessor. We evaluate the performance gain of dynamic load balancing across thread teams using a ray tracing code. The results show that such a dynamic load balancing mechanism can improve the performance by up to 14% compared to static load balancing across teams, with considering scheduling overhead.

View full abstract

Download PDF (679K)
A Directive Generation Approach to High Code-Maintainability for Various HPC Systems

Kazuhiko Komatsu, Ryusuke Egawa, Hiroyuki Takizawa, Hiroaki Kobayashi

2017Volume 7Issue 2 Pages 405-418
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_405

JOURNAL OPEN ACCESS

Show abstractHide abstract

The emergence of various high-performance computing (HPC) systems compels users to write a code considering the characteristic of each HPC system. To describe the system-dependent information without drastic code modifications, the directive sets such as the OpenMP directive set and the OpenACC directive set are proofed to be useful. However, the code becomes complex to achieve high performance on various HPC systems because different directive sets are required for various HPC systems. Thus, the code-maintainability and readability are degraded. This paper proposes a directive generation approach that generates various kinds of directive sets using user-defined rules. Instead of using several kinds of directive sets, users only have to write special placeholders that are utilized to specify a unique code pattern where several directives are inserted. Then, the special placeholders trigger the generation of appropriate directives for each system using a user-defined rule with a code transformation framework Xevolver. Because only special placeholders are inserted in the code, the proposed approach can keep the code-maintainability and readability. From the performance evaluations of directive-based implementations on various HPC systems, it is shown that the best implementation is different among the HPC systems. Then, through the demonstration of transformation into multiple kinds of implementations, the proposed approach can successfully generate directives from a smaller number of special placeholders. Therefore, it is clarified that the proposed directive generation approach is effective to keep the maintainability of a code to be executed on various HPC systems.

View full abstract

Download PDF (377K)

Regular Paper

On the Design of Automatic Link Establishment in High Frequency Networks

Bruno Baynat, Hicham Khalife, Vania Conan, Catherine Lamy-Bergot, Roma ...

2017Volume 7Issue 2 Pages 419-446
Published: 2017
Released on J-STAGE: October 04, 2017

DOIhttps://doi.org/10.15803/ijnc.7.2_419

JOURNAL OPEN ACCESS

Show abstractHide abstract

Most High Frequency (HF) communications systems deployed on the field today implement Automatic Link Establishment (ALE) techniques in order to help the HF stations automatically set up a link with good properties. Two generations (so called 2G and 3G ALE) have been standardized since the 90's, and are today being revisited due to the emergence of wideband HF waveforms. In this paper, we develop Markovian models of the 2G ALE procedure, which is nowadays the most widely used as it can operate while being completely asynchronous. Our models are "channel oriented", i.e., they observe the system from channel occupation perspective regardless of node status. We show by comparison with high-level OMNET++ simulations that our models provide fast and accurate estimation of all performance parameters of interest, and capture the main characteristics of the ALE process and the interactions between their numerous parameters. We believe that our work constitutes a useful tool to help operator plan and dimension HF networks. We also exploit the model to give some insight on the limitations of current 2G ALE, helping the design of future ALE strategies.

View full abstract

Download PDF (1267K)

Register with J-STAGE for free!