Preliminary Performance Analysis of Distributed DNN Training with Relaxed Synchronization

Koichi SHIRAHATA; Amir HADERBACHE; Naoto FUKUMOTO; Kohta NAKASHIMA

doi:10.1587/transele.2020LHS0001

This article has now been updated. Please use the final version.

Preliminary Performance Analysis of Distributed DNN Training with Relaxed Synchronization

Koichi SHIRAHATA, Amir HADERBACHE, Naoto FUKUMOTO, Kohta NAKASHIMA

Author information

Keywords: relaxed synchronization, dynamic performance optimization, distributed deep learning, approximate computing

JOURNAL RESTRICTED ACCESS Advance online publication

Article ID: 2020LHS0001

DOI https://doi.org/10.1587/transele.2020LHS0001

The final version of this article is now available: Vol. E104.C (2021), No. 6 pp. 257-260

Details

Abstract

Scalability of distributed DNN training can be limited by slowdown of specific processes due to unexpected hardware failures. We propose a dynamic process exclusion technique so that training throughput is maximized. Our evaluation using 32 processes with ResNet-50 shows that our proposed technique avoids slowdown by excluding the slow processes by 12.5% to 50% without accuracy loss.

Corresponding author

Register with J-STAGE for free!