News

Qwerky-72B Model Trained Efficiently on 8 AMD MI300X GPUs

April 01, 2025
Qwerky-72B AMD MI300X GPU AI training large-scale models attention-free models
The Qwerky-72B model, a large attention-free model, was successfully trained using only 8 AMD MI300X GPUs, showcasing the efficiency and scalability of AMD's MI300X accelerators for large-scale model training.

Qwerky-72B Model Trained Efficiently on 8 AMD MI300X GPUs

Video: AMD MI300X server review 8x GPUs | Llama 405b model tested

The Qwerky-72B model, a large attention-free model, was trained using only 8 AMD MI300X GPUs. This highlights the efficiency and scalability of AMD's MI300X accelerators for training large-scale models. The MI300X GPUs are equipped with high bandwidth memory (HBM) and a significant number of streaming multiprocessors (SMs), which contribute to their ability to handle extensive computational workloads.

For more detailed insights into the training process and the specific optimizations applied to leverage the MI300X hardware, you can refer to the following resources:

Sources

Qwerky-72B trained using only 8 AMD MI300X GPUs | Hacker News > At a high level, you take an existing transformer model Freeze all the weights, delete the attention layer, replace it with RWKV, and train it ...
Qwerky-72B and 32B : Training large attention free models, with ... Qwerky-72B and 32B : Training large attention free models, with only 8 GPU's ‼️ Attention is NOT all you need ‼️
Training Transformers and Hybrid models on AMD Instinct MI300X ... We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models.