Scalable and Practical Natural Gradient for Large-Scale Deep Learning


Because of the increase in the effective mini-batch size, the generalization performance of the models produced by large-scale distributed training of deep neural networks is inferior. This is the result of the larger effective mini-batch size. Previous methods have attempted to solve this issue by altering the learning rate and batch size across epochs and layers, as well as by making ad hoc modifications to the batch normalization process. We propose scalable and practical natural gradient descent (SP-NGD), a principled approach for training models that enables them to achieve similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. This is accomplished through the use of a natural gradient descent algorithm that is scalable and can be implemented practically. In addition, in contrast to first-order methods, SP-NGD is able to scale to large mini-batch sizes with only a negligible increase in the amount of computational overhead. Training a ResNet-50 model to classify images on ImageNet was the benchmark task that we used to evaluate SP-NGD. The available references for this task were highly optimized first-order methods. We show that it is possible to converge to a top-1 validation accuracy of 75.4% in 5.5 minutes when using a mini-batch size of 32,768 and 1,024 GPUs. Additionally, we show that it is possible to converge to an accuracy of 74.9 % when using an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.

