Stable Diffusion, a popular open-source model, faces challenges in pre-training due to its large-scale nature and computational intensity. To address these issues, an advanced pre-training solution for Stable Diffusion v2 models is introduced, leveraging the power of Ray and Anyscale Platform to enhance scalability and cost efficiency. The solution involves offline preprocessing, which boosts training throughput by 1.45x and reduces training costs by 18%, while also providing fine-grained control over concurrency and batch sizes for each stage. Additionally, fault-tolerant training is implemented using Ray Train, which automatically rescales the cluster, restores the latest checkpoint from cloud storage, and continues training in case of hardware or software failures. The solution also employs various optimizations, including Elastic Fabric Adapter (EFA), Fully Sharded Data Parallel (FSDP), and Torch.compile, to accelerate U-Net training and improve throughput by ~3x compared to vanilla PyTorch solutions. Overall, the pre-training solution for Stable Diffusion models reduces training costs to less than $40,000, a significant improvement over traditional methods.