Fine tune 120b at 8 H100s getting cuda OOM error

#117

by jinxu88 - opened 20 days ago

20 days ago

I am using this script (https://github.com/huggingface/gpt-oss-recipes/blob/main/README.md) for tuning got oss 120b. The OpenAI blog (https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) mentioned it is doable on single H100, but I kept getting OOM. Any one successfully fine tuned it on H100?

shaobaij

18 days ago

The case you linked seems to SFT 20b not 120b...

jinxu88

18 days ago

I am interested in hearing if anyone has managed a successful fine tune run for 120b on H100, as the below model card mentioned. The GitHub link was only provided as a reference, but it did not work for the 120B model on H100.

Model card mentioned:

Fine-tuning
Both gpt-oss models can be fine-tuned for a variety of specialized use cases.
This larger model gpt-oss-120b can be fine-tuned on a single H100 node, whereas the smaller gpt-oss-20b can even be fine-tuned on consumer hardware.

yuchenxie

16 days ago

Pretty sure it meant QLoRA with deepspeed Z3 and other optimizations.

kwonmha

1 day ago

I'm still facing this issue.
Couldn't run finetuning 120b with deepspeed zero3 due to OOM.
Hope to see others succeeded.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment