open-llm-leaderboard/open_llm_leaderboard
Merging in DeepSeek R1 distillation to Llama 3.1 8B (at 10% task arithmetic weight, using the Llama 3.1 8B base model as the case rather than the instruct model) with a prior best merge resulted in a slightly lower IFEval, but a higher result in every other benchmark save for MMLU-PRO, which went down only marginally. MATH Lvl5 and GPQA went up palpably.
grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B
This result is currently my best Llama 3.1 8B merge result to date. The actual R1 distillation itself scored quite badly, so this would seem to be another case of unexpected formatting (reflected in IFEval) hurting the evaluation results, obscuring the strength of a model.
It is also possible to use the text generation feature of this model to generate roleplay completions. Based on informal testing, this model's bias toward problem-solving will subtly impact narration.