Can't reproduce the evaluation result of GPQA dataset

#47
by Rinn000 - opened

I'v tried zero-shot/few-shot prompts to evaluate the performace of this model. However, the result is far below 60% accuracy, which is shown on the linked blog. Could you share your official benchmark progress/prompt/code? By the way, the extraction & prompting is aligned with the format of GPQA papaer.

I have the same problem got 37% accuracy.
Question for Qwen team.
What are your recommended hyperparameters for evaluation ?
I got significantly lower results on almost all benchmarks mentioned in the presentation.

Same question, I would like to know if the original article reports greedy decoding or sampling results?

1735180092041.png
选了个GPQA的示例问了一下,生成结果和Qwen-2.5一样,甚至没有我调过的模型效果好

+1

Same here, has anyone managed to reproduce the benchmarks. Seems like it is hard to get process the response for proper eval.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment