The NVIDIA H100 Tensor Core GPU offers significant improvements over its predecessor, the A100 SXM GPU, in terms of performance and scalability. The new GPU features fourth-generation Tensor Cores, which deliver 3x throughput on Tensor Core data types, including FP32 and FP64. It also has an increased number of streaming multiprocessors and higher clock frequencies, resulting in a 22% increase in SM count and a 30% increase in clock frequency compared to the A100 GPU. The H100 GPU's new FP8 data type quadruples the computational rates clock for clock per SM of FP16 on A100, and with the help of Transformer Engine, it accelerates AI calculations for transformer-based models such as large language models. The GPU also features updated NVIDIA NVLink and NVIDIA NVSwitch technology, which provide 3x increase in all-reduce throughput across eight GPUs within a single node and a 4.5x increase for 256 GPUs across 32 nodes. This makes it particularly useful for model parallelization and large-scale distributed training. In real-world deep learning applications, the speedup varies by workload, with language models benefiting more than vision-based models. Overall, the H100 GPU is optimized for the largest models, specifically transformer-based, whether for large language, vision, or life sciences applications that involve structured sparsity.