Update README.md
Browse files
README.md
CHANGED
@@ -81,24 +81,41 @@ See the Falcon 180B model card for an example of this.
|
|
81 |
|
82 |
## Performance
|
83 |
|
84 |
-
| Model
|
85 |
-
|
86 |
-
|
|
87 |
-
|
|
88 |
-
|
|
89 |
-
|
|
90 |
-
|
|
91 |
-
|
|
92 |
-
|
|
93 |
-
|
|
94 |
-
|
|
95 |
-
|
|
96 |
-
|
|
97 |
-
|
|
98 |
-
|
|
99 |
-
|
|
100 |
-
|
|
101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
|
103 |
## License and use
|
104 |
|
|
|
81 |
|
82 |
## Performance
|
83 |
|
84 |
+
| Model | Average | 2 LC | BBH | DROP | GSM8k | IFEval | MATH | MMLU | Safety | PopQA | TruthQA |
|
85 |
+
|-------|---------|------|-----|------|-------|--------|------|------|--------|-------|---------|
|
86 |
+
| **Closed API models** | | | | | | | | | | | |
|
87 |
+
| GPT-3.5 Turbo 0125 | 59.6 | 38.7 | 66.6 | 70.2 | 74.3 | 66.9 | 41.2 | 70.2 | 69.1 | 45.0 | 62.9 |
|
88 |
+
| GPT 4o Mini 2024-07-18 | 65.7 | 49.7 | 65.9 | 36.3 | 83.0 | 83.5 | 67.9 | 82.2 | 84.9 | 39.0 | 64.8 |
|
89 |
+
| **Open weights models** | | | | | | | | | | | |
|
90 |
+
| Mistral-Nemo-Instruct-2407 | 50.9 | 45.8 | 54.6 | 23.6 | 81.4 | 64.5 | 31.9 | 70.0 | 52.7 | 26.9 | 57.7 |
|
91 |
+
| Ministral-8B-Instruct | 52.1 | 31.4 | 56.2 | 56.2 | 80.0 | 56.4 | 40.0 | 68.5 | 56.2 | 20.2 | 55.5 |
|
92 |
+
| Gemma-2-27b-it | 61.3 | 49.0 | 72.7 | 67.5 | 80.7 | 63.2 | 35.1 | 70.7 | 75.9 | 33.9 | 64.6 |
|
93 |
+
| Qwen2.5-32B | 66.5 | 39.1 | 82.3 | 48.3 | 87.5 | 82.4 | 77.9 | 84.7 | 82.4 | 26.1 | 70.6 |
|
94 |
+
| Mistral-Small-24B | 67.6 | 43.2 | 80.1 | 78.5 | 87.2 | 77.3 | 65.9 | 83.7 | 66.5 | 24.4 | 68.1 |
|
95 |
+
| Llama-3.1-70B | 70.0 | 32.9 | 83.0 | 77.0 | 94.5 | 88.0 | 56.2 | 85.2 | 76.4 | 46.5 | 66.8 |
|
96 |
+
| Llama-3.3-70B | 73.0 | 36.5 | 85.8 | 78.0 | 93.6 | 90.8 | 71.8 | 85.9 | 70.4 | 48.2 | 66.1 |
|
97 |
+
| Gemma-3-27b-it | - | 63.4 | 83.7 | 69.2 | 91.1 | - | - | 81.8 | - | 30.9 | - |
|
98 |
+
| **Fully open models** | | | | | | | | | | | |
|
99 |
+
| OLMo-2-7B-1124-Instruct | 55.7 | 31.0 | 48.5 | 58.9 | 85.2 | 75.6 | 31.3 | 63.9 | 81.2 | 24.6 | 56.3 |
|
100 |
+
| OLMo-2-13B-1124-Instruct | 61.4 | 37.5 | 58.4 | 72.1 | 87.4 | 80.4 | 39.7 | 68.6 | 77.5 | 28.8 | 63.9 |
|
101 |
+
| **OLMo-2-32B-0325-SFT** | 61.7 | 16.9 | 69.7 | 77.2 | 78.4 | 72.4 | 35.9 | 76.1 | 93.8 | 35.4 | 61.3 |
|
102 |
+
| **OLMo-2-32B-0325-DPO** | 68.8 | 44.1 | 70.2 | 77.5 | 85.7 | 83.8 | 46.8 | 78.0 | 91.9 | 36.4 | 73.5 |
|
103 |
+
| **OLMo-2-32B-0325-Instruct** | 68.8 | 42.8 | 70.6 | 78.0 | 87.6 | 85.6 | 49.7 | 77.3 | 85.9 | 37.5 | 73.2 |
|
104 |
+
|
105 |
+
## Benchmark Descriptions
|
106 |
+
|
107 |
+
- **2 LC**: Two-step logical constraints reasoning
|
108 |
+
- **BBH**: Big Bench Hard tasks
|
109 |
+
- **DROP**: Discrete Reasoning Over Paragraphs
|
110 |
+
- **GSM8k**: Grade School Math 8k problems
|
111 |
+
- **IFEval**: Instruction Following Evaluation
|
112 |
+
- **MATH**: Mathematics problem-solving
|
113 |
+
- **MMLU**: Massive Multitask Language Understanding
|
114 |
+
- **Safety**: Safety and harmlessness evaluation
|
115 |
+
- **PopQA**: Popular Question Answering
|
116 |
+
- **TruthQA**: Truthfulness in question answering
|
117 |
+
|
118 |
+
*Note: Replace the "Your Model" row with your model's evaluation results. You can add additional information about your model's performance compared to others.*
|
119 |
|
120 |
## License and use
|
121 |
|