TensorFlow 2.0 Tutorial 05: Distributed Training across Multiple Nodes

Company

Lambda

Date Published

June 7, 2019

Author

Chuan Li

Word count

691

Language

English

Hacker News points

None

URL

lambda.ai/blog/tensorflow-2-0-tutorial-05-distributed-training-multi-node

Summary

To achieve distributed training across multiple nodes with TensorFlow 2.0, a cluster setup is necessary, where each node has its own TF_CONFIG environment variable that describes the machine's role in the cluster. This includes the IP and port information for each worker node, as well as the task type and index for each node. The code boilerplate uses tf.distribute.experimental.MultiWorkerMirroredStrategy to create a distributed strategy, which is then used to scope the model definition. The training script needs to be customized for each node, with the TF_CONFIG environment variable set differently on each node. To run distributed training, the nodes need to be able to SSH into each other without password authentication, and the script needs to be run simultaneously on both nodes, using a synchronized output due to the Mirrored strategy used.