Possible issue with evaluation scores of Falcon-H1 Models

#20
by rcojocaru - opened

Hi.

We were checking the results on the leaderboard for Falcon-H1 models, which unfortunately are much lower than what we expected based on our internal benchmarks. Amongst other things, we noticed that the results don't correlate well with model size, and that many scores are 0.

I would like to discuss some potential issues that might have affected the scores, and hopefully we can debug this together. I would also like to ask if it would be possible to rerun the evaluation after understanding the cause of the low scores, as we believe the present scores are not representative of the models' true capabilities.

  1. When checking the generation files specifically for Falcon3-H1-34B, for example in:
    https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/blob/main/tiiuae/Falcon-H1-34B-Base/raw_2025-07-15T04-30-48.728728/pretrained__tiiuae__Falcon-H1-34B-Base%2Cdtype__bfloat16%2Cdevice__cuda%3A0%2Crevision__main%2Ctrust_remote_code__True%2Cmax_gen_toks__1536%2Cstarting_max_length__2560_enem_challenge.jsonl
    I saw the following parameters:
    "original_fewshots_size": 3,
    "effective_fewshots_size": 0,
    Hence it is unclear if the evaluation for ENEM was run using 3-shots or 0-shot. Also, it was my understanding that all models should be run under a few-shot scenario on the leaderboard. Could you please confirm if the evaluations for this model were run as 0-shot or few-shot?

  2. In the above case I also saw that the generated output was:
    "resps": [
    [
    "\n"
    ]
    ]
    It seems the generation was stopped before the model could actually output any useful tokens. I tried to reproduce this under different scenarios:

  • under 0-shot, my tests show that the generation indeed starts with a "\n" followed by the useful token (a letter A/B/C/D).
  • under 3-shot, my tests showed the model already produced " letter", as expected.
  • in no case did the model produce just "\n", hence either the generation was interrupted or there was a postprocessing issue cuting the answer before the useful token.
  1. In some instances, we noticed that the generation was:
    "resps": [
    [
    " N"
    ]
    ]
    instead of "Não", or
    "resps": [
    [
    " Pos/Neg/Ne"
    ]
    ]
    instead of "Positivo"/"Negativo"/"Neutro". We again suspect that maybe the generation is incomplete or truncated.
    Here are some examples where the above issues can be found:
    https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/blob/main/tiiuae/Falcon-H1-7B-Base/raw_2025-07-09T15-41-50.552612/pretrained__tiiuae__Falcon-H1-7B-Base%2Cdtype__bfloat16%2Cdevice__cuda%3A0%2Crevision__main%2Ctrust_remote_code__True%2Cmax_gen_toks__1536%2Cstarting_max_length__2560_assin2_rte.jsonl
    https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/blob/main/tiiuae/Falcon-H1-3B-Base/raw_2025-07-09T04-02-33.419627/pretrained__tiiuae__Falcon-H1-3B-Base%2Cdtype__bfloat16%2Cdevice__cuda%3A0%2Crevision__main%2Ctrust_remote_code__True%2Cmax_gen_toks__1536%2Cstarting_max_length__2560_tweetsentbr.jsonl
    https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/blob/main/tiiuae/Falcon-H1-0.5B-Base/raw_2025-07-08T05-07-27.156126/pretrained__tiiuae__Falcon-H1-0.5B-Base%2Cdtype__bfloat16%2Cdevice__cuda%3A0%2Crevision__main%2Ctrust_remote_code__True%2Cmax_gen_toks__1536%2Cstarting_max_length__2560_tweetsentbr.jsonl

  2. To try to replicate the mentiond issues, I pulled the fork of lm-eval: https://github.com/eduagarcia/lm-evaluation-harness-pt
    Then I chose a smaller model tiiuae/Falcon-H1-0.5B-Base that had scored 0 on tweetSentBR. I ran this command:
    lm_eval --model hf --model_args pretrained=tiiuae/Falcon-H1-0.5B-Base,trust_remote_code=True,apply_chat_template=False --tasks tweetsentbr --device cuda:0 --batch_size auto --output_path results/tiiuae/Falcon-H1-0.5B-Base --log_samples
    The results and samples are available here: https://huggingface.co/datasets/rcojocaru/eval-results-Falcon-H1-0.5B-Base-tweetsentbr/tree/main
    The score I got was higher (54) and the generations looked complete, with no truncation. This reinforces my belief that some issue plagued most of the evaluations of the H1 models, maybe related to the lm-eval branch or the version of transformers used? In my case I used the main branch of the form and transformers version '4.55.3'.

Please let me know your feedback on the above issues when possible. Thank you!

Best regards,
Ruxandra

Thank you for the detailed issue, it helped a lot.

I ran some tests and found that the issue was caused by a couple of bugs that I introduced while trying to add support for reasoning models, along with some behavior changes in the Hugging Face tokenizer library. The bug was triggered when passing the --model_args max_gen_toks=X, which is why the problem did not happen in your local tests.

I made a fix in the following commit:
https://github.com/eduagarcia/lm-evaluation-harness-pt/commit/ca07215614ee28353ac61e1d745943089b3fa4f8

I will rerun the Falcon-H1 models and other affected models in the following days.

Here's a list of affected models: https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_requests/commit/dcd7342da2fde2d1cdd61e19a87b7df12b7ebc8a

allenai/OLMo-2-0325-32B
allenai/OLMo-2-0325-32B-Instruct
BornSaint/Dare_Angel_8B
byroneverson/gemma-2-27b-it-abliterated
CEIA-UFG/Gemma-3-Gaia-PT-BR-4b-it
deepcogito/cogito-v1-preview-llama-70B
deepseek-ai/DeepSeek-R1-Distill-Llama-8B
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
google/gemma-3n-E2B
google/gemma-3n-E2B-it
google/gemma-3n-E4B
google/gemma-3n-E4B-it
google/medgemma-27b-it
google/medgemma-27b-text-it
google/medgemma-4b-it
google/medgemma-4b-pt
HuggingFaceTB/SmolLM3-3B
HuggingFaceTB/SmolLM3-3B-Base
huihui-ai/DeepSeek-R1-Distill-Qwen-14B-abliterated-v2
langtech-languagemodeling/IberianLLM-7B-Instruct
migtissera/Tess-v2.5-Gemma-2-27B-alpha
prithivMLmods/Qwen2.5-14B-DeepSeek-R1-1M
qihoo360/TinyR1-32B-Preview
Qwen/Qwen2.5-14B
Qwen/Qwen2.5-14B-Instruct
TheDrummer/Big-Tiger-Gemma-27B-v1
THUDM/GLM-4-32B-0414
THUDM/GLM-Z1-32B-0414
THUDM/GLM-Z1-Rumination-32B-0414
TIGER-Lab/Qwen2.5-32B-Instruct-CFT
tiiuae/Falcon-H1-0.5B-Base
tiiuae/Falcon-H1-0.5B-Instruct
tiiuae/Falcon-H1-1.5B-Deep-Base
tiiuae/Falcon-H1-1.5B-Deep-Instruct
tiiuae/Falcon-H1-1.5B-Instruct
tiiuae/Falcon-H1-34B-Base
tiiuae/Falcon-H1-3B-Base
tiiuae/Falcon-H1-3B-Instruct
tiiuae/Falcon-H1-7B-Base
tiiuae/Falcon-H1-7B-Instruct
v000000/Qwen2.5-Lumen-14B
YOYO-AI/Qwen2.5-32B-YOYO-reasoning
zetasepic/Qwen2.5-32B-Instruct-abliterated-v2

It's great that you found the issue. Thank you for the update!
Yes, when I tried to reproduce the eval, I did not use the max_gen_toks flag as I did not see it in the ReadMe. If this is normally used, can I ask with what value for each task?

The additional arguments are --model_args max_gen_toks=1536,starting_max_length=2560
They should change results of reasoning models by increasing the output length and parsing out the thinking part before the final answer, but for non-reasoning models like the Falcon-H1 shouldn't be affected by these arguments, it should give the same result regardless.

Is an feature still in development, so there may be changes in the future.

The filename also has all the arguments passed to the evaluation.

Ex:
pretrained__tiiuae__Falcon-H1-0.5B-Base,dtype__bfloat16,device__cuda:0,revision__main,trust_remote_code__True,max_gen_toks__1536,starting_max_length__2560_tweetsentbr
is the same as:
--model_args pretrained=iiuae/Falcon-H1-0.5B-Base,dtype=bfloat16,device=cuda:0,trust_remote_code=True,max_gen_toks=1536,starting_max_length=2560

Sign up or log in to comment