noneUsername/Mistral-Small-24B-Instruct-2501-writer-W8A8

vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-writer,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.924	±	0.0168
		strict-match	5	exact_match	↑	0.920	±	0.0172

vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-writer,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.918	±	0.0123
		strict-match	5	exact_match	↑	0.908	±	0.0129

vllm (pretrained=/root/autodl-tmp/Mistral-Small-24B-Instruct-2501-writer,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7977	±	0.0131
- humanities	2	none	acc	↑	0.8256	±	0.0263
- other	2	none	acc	↑	0.8205	±	0.0265
- social sciences	2	none	acc	↑	0.8556	±	0.0256
- stem	2	none	acc	↑	0.7263	±	0.0249

vllm (pretrained=/root/autodl-tmp/70-512-df10-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.904	±	0.0187
		strict-match	5	exact_match	↑	0.904	±	0.0187

vllm (pretrained=/root/autodl-tmp/70-512-df10-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.904	±	0.0132
		strict-match	5	exact_match	↑	0.900	±	0.0134

vllm (pretrained=/root/autodl-tmp/70-512-df10-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7836	±	0.0132
- humanities	2	none	acc	↑	0.8103	±	0.0261
- other	2	none	acc	↑	0.8000	±	0.0269
- social sciences	2	none	acc	↑	0.8556	±	0.0248
- stem	2	none	acc	↑	0.7088	±	0.0256

vllm (pretrained=/root/autodl-tmp/70-512-df8,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.896	±	0.0193
		strict-match	5	exact_match	↑	0.896	±	0.0193

vllm (pretrained=/root/autodl-tmp/70-512-df8,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.894	±	0.0138
		strict-match	5	exact_match	↑	0.886	±	0.0142

vllm (pretrained=/root/autodl-tmp/70-512-df8,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7801	±	0.0134
- humanities	2	none	acc	↑	0.8154	±	0.0261
- other	2	none	acc	↑	0.7795	±	0.0280
- social sciences	2	none	acc	↑	0.8444	±	0.0263
- stem	2	none	acc	↑	0.7158	±	0.0254

vllm (pretrained=/root/autodl-tmp/70-512-df10,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.92	±	0.0172
		strict-match	5	exact_match	↑	0.92	±	0.0172

vllm (pretrained=/root/autodl-tmp/70-512-df10,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.910	±	0.0128
		strict-match	5	exact_match	↑	0.906	±	0.0131

vllm (pretrained=/root/autodl-tmp/70-512-df10,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7836	±	0.0133
- humanities	2	none	acc	↑	0.8103	±	0.0267
- other	2	none	acc	↑	0.7949	±	0.0271
- social sciences	2	none	acc	↑	0.8556	±	0.0252
- stem	2	none	acc	↑	0.7123	±	0.0257

vllm (pretrained=/root/autodl-tmp/70-512-df11,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.912	±	0.0180
		strict-match	5	exact_match	↑	0.908	±	0.0183

vllm (pretrained=/root/autodl-tmp/70-512-df11,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.902	±	0.0133
		strict-match	5	exact_match	↑	0.896	±	0.0137

vllm (pretrained=/root/autodl-tmp/70-512-df11,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7953	±	0.0129
- humanities	2	none	acc	↑	0.8308	±	0.0251
- other	2	none	acc	↑	0.7846	±	0.0275
- social sciences	2	none	acc	↑	0.8722	±	0.0241
- stem	2	none	acc	↑	0.7298	±	0.0247

vllm (pretrained=/root/autodl-tmp/86-512-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.888	±	0.0200
		strict-match	5	exact_match	↑	0.884	±	0.0203

vllm (pretrained=/root/autodl-tmp/86-512-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0145
		strict-match	5	exact_match	↑	0.872	±	0.0150

vllm (pretrained=/root/autodl-tmp/86-512-uc,add_bos_token=true,max_model_len=4096,dtype=bfloat16,max_num_seqs=3), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.7897	±	0.0280
- other	2	none	acc	↑	0.7744	±	0.0280
- social sciences	2	none	acc	↑	0.8833	±	0.0233
- stem	2	none	acc	↑	0.6947	±	0.0258

noneUsername
/

Mistral-Small-24B-Instruct-2501-writer-W8A8

Model tree for noneUsername/Mistral-Small-24B-Instruct-2501-writer-W8A8