This presentation provides an overview of the different types of training regimes in deep learning, from single GPU to multi-node distributed training. It explains how computation happens, gradient transfers occur, and models are updated and communicated across GPUs and nodes. The presentation also discusses hardware considerations, such as NVLink, InfiniBand networking, and GPUs that support features like GPU Direct RDMA, which enable efficient data transfer between nodes. Specifically, it highlights the benefits of using GPU Direct RDMA for high-speed data transfer, achieving up to 42GB/s bandwidth between nodes, making it suitable for large-scale image, language, and speech models such as NasNet, BERT, and GPT-2.