Qwen
/

QwQ-32B-Preview

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Can't reproduce the evaluation result of GPQA dataset

#47

by Rinn000 - opened Dec 16, 2024

Rinn000

Dec 16, 2024

I'v tried zero-shot/few-shot prompts to evaluate the performace of this model. However, the result is far below 60% accuracy, which is shown on the linked blog. Could you share your official benchmark progress/prompt/code? By the way, the extraction & prompting is aligned with the format of GPQA papaer.

raidhon

Dec 17, 2024

I have the same problem got 37% accuracy.
Question for Qwen team.
What are your recommended hyperparameters for evaluation ?
I got significantly lower results on almost all benchmarks mentioned in the presentation.

Dec 23, 2024

Same question, I would like to know if the original article reports greedy decoding or sampling results?

Dec 26, 2024

选了个GPQA的示例问了一下，生成结果和Qwen-2.5一样，甚至没有我调过的模型效果好

wwn

Jan 7

+1

Jan 18

Same here, has anyone managed to reproduce the benchmarks. Seems like it is hard to get process the response for proper eval.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment