Fine tune 120b at 8 H100s getting cuda OOM error

#117
by jinxu88 - opened

I am using this script (https://github.com/huggingface/gpt-oss-recipes/blob/main/README.md) for tuning got oss 120b. The OpenAI blog (https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) mentioned it is doable on single H100, but I kept getting OOM. Any one successfully fine tuned it on H100?

The case you linked seems to SFT 20b not 120b...

I am interested in hearing if anyone has managed a successful fine tune run for 120b on H100, as the below model card mentioned. The GitHub link was only provided as a reference, but it did not work for the 120B model on H100.

Model card mentioned:

Fine-tuning
Both gpt-oss models can be fine-tuned for a variety of specialized use cases.
This larger model gpt-oss-120b can be fine-tuned on a single H100 node, whereas the smaller gpt-oss-20b can even be fine-tuned on consumer hardware.

Pretty sure it meant QLoRA with deepspeed Z3 and other optimizations.

I'm still facing this issue.
Couldn't run finetuning 120b with deepspeed zero3 due to OOM.
Hope to see others succeeded.

Sign up or log in to comment