Article ID: 2020LHS0001
Scalability of distributed DNN training can be limited by slowdown of specific processes due to unexpected hardware failures. We propose a dynamic process exclusion technique so that training throughput is maximized. Our evaluation using 32 processes with ResNet-50 shows that our proposed technique avoids slowdown by excluding the slow processes by 12.5% to 50% without accuracy loss.