Training LLMs on AMD GPU?

2 min readSep 3, 2023

In case you’ve missed it.
Thanks to MosaicML, training is now supported on AMD MI250 GPUs which finally reduces dependency on NVIDIA systems.

The training stack, when used with AMD MI250 GPUs, achieved comparable results to NVIDIA A100 systems. Loss curves remained consistent (see below), even when switching between AMD and NVIDIA GPUs within a single training run.

Profiling of training throughput for Multilingual Pretraining models ranging from 1B to 13B parameters revealed that MI250 GPUs achieved approximately 80% of the per-GPU throughput of A100–40GB and 73% of A100–80GB. Of course, as AMDs software continues to improve, this performance gap is expected to diminish further.

The best part about it is that the LLM Foundry training stack required no code modifications when transitioning from NVIDIA to AMD systems. PyTorch, FSDP, Composer, StreamingDataset, and LLM Foundry components worked seamlessly with existing training workflows.
While initial tests were performed on a single node with 4xMI250 GPUs, further validation is being conducted on larger AMD GPU clusters.

AMD MI250 GPU Features:
The AMD MI250 GPU is a datacenter accelerator equipped with High Bandwidth Memory and so-called Matrix Cores, similar to NVIDIA’s Tensor Cores.
Key features include:
- Higher peak TFLOP/s than NVIDIA A100 in FP16 or BF16 formats
- Larger HBM memory capacity (128GB) compared to even the largest A100 (80GB)
- Comparable power consumption per GPU to NVIDIA A100, with potential for better power efficiency
- Typically bundled in smaller system configurations with 4 GPUs, while NVIDIA A100 systems traditionally feature 8 GPUs

AMD has developed the ROCm software stack to replace CUDA, RCCL as an alternative to NCCL, and Infinity Fabric as a substitute for NVSwitch within a node. They also support Infiniband or RDMA over Converged Ethernet for inter-node communication, providing networking infrastructure similar to NVIDIA’s offerings.

As Károly Zsolnai-Fehér would say — What a time to be alive !

Training LLMs on AMD GPU?

Written by Anel Music