BERT, a pre-training language representation developed by Google, has achieved state-of-the-art results on various Natural Language Processing tasks. However, its official TPU-friendly implementation only supports single GPU usage at present. This blog post aims to make BERT work with multiple GPUs using Horovod, a framework for parallelizing tasks. The authors have made several changes to the original BERT implementation, including importing Horovod's Tensorflow backend, initializing the library, pinning each worker to a GPU, and adapting gradient clipping accordingly. By leveraging these modifications, the authors demonstrate improved performance on multiple GPUs, with throughput increases ranging from 126.92 examples/sec for sentence classification (2 GPUs) to 231.26 examples/sec for the same task (4 GPUs). The authors also highlight potential pitfalls, such as using unsynchronized models across different workers and adapting gradient clipping accordingly. By following these changes and modifications, developers can adapt BERT for multi-GPU usage, leading to improved performance on various NLP tasks.