allenai
/

OLMo-2-0325-32B-DPO

Text Generation

PyTorch

English

olmo2

conversational

Model card Files Files and versions Community

amanrangapur commited on 12 days ago

Commit

8a83ddb

verified ·

1 Parent(s): 24c4f95

Update README.md

Browse files

Files changed (1) hide show

README.md +35 -18

README.md CHANGED Viewed

@@ -81,24 +81,41 @@ See the Falcon 180B model card for an example of this.
 ## Performance
-| Model             | AVG    | AE2    | BBH    | DROP   | GSM8K  | IFE    | MATH   | MMLU   | Safety | PQA    | TQA    |
-|-------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
-| OLMo 2 7B SFT     | 51.4    | 10.2   | 49.6   | 59.6   | 74.6   | 66.9   | 25.3   | 61.1   | 94.6   | 23.6   | 48.6   |
-| OLMo 2 7B DPO     |  55.9   | 27.9   | 51.1   | 60.2   | 82.6   | 73.0   | 30.3   | 60.8   | 93.7   | 23.5   | 56.0   |
-| OLMo 2 7B Instruct|  56.5   | 29.1   | 51.4   | 60.5   | 85.1   | 72.3   | 32.5   | 61.3   | 93.3   | 23.2   | 56.5   |
-| OLMo 2 13B SFT    | 56.6   | 11.5   | 59.9   | 71.3   | 76.3   | 68.6   | 29.5   | 68.0   | 94.3   | 29.4   | 57.1   |
-| OLMo 2 13B DPO    |  62.0   | 38.3   | 61.4   | 71.5   | 82.3   | 80.2   | 35.2   | 67.9   | 90.3   | 29.0   | 63.9   |
-| OLMo 2 13B Instruct|  63.4  | 39.5   | 63.0   | 71.5   | 87.4   | 82.6   | 39.2   | 68.5   | 89.7   | 28.8   | 64.3   |
-| **OLMo 2 32B SFT**| 58.09  | 14.49  | 67.10  | 75.68  | 79.76  | 74.49  | 36.02  | 77.80  | -      | 34.25  | 63.26  |
-| **OLMo 2 32B DPO**| 64.95  | 46.18  | 68.05  | 76.45  | 85.60  | 80.59  | 39.08  | 78.26  | -      | 36.57  | 73.78  |
-| **OLMo 2 32B Instruct**| 66.17  | 45.70  | 69.10  | 76.49  | 89.08  | 83.55  | 42.74  | 78.53  | -      | 36.70  | 73.64  |
-| Gemma-2-27b       |  61.32 | 49.01  | 72.69  | 67.52  | 80.67  | 63.22  | 35.06  | 70.66  | 75.9   | 33.85  |  64.58  |
-| GPT-3.5 Turbo 0125| 59.56  | 38.7   | 66.6   | 70.2   | 74.3   | 66.9   | 41.2   | 70.2   | 69.1   | 45.0   | 62.9*   |
-| GPT 4o Mini2024-07-18| 65.72  |  49.7  |  65.9*  |  36.3  |  83.0  |  83.5  |   67.9 |   82.2 | 84.9   |  39.0  | 64.8*   |
-| Qwen2.5-32B       | 66.54  | 39.07  | 82.34  | 48.26  | 87.49  | 82.44  | 77.89  | 84.66  | 82.4   | 26.10  | 70.57  |
-| Mistral-Small-24B | 67.6   | 43.20  | 80.11  | 78.51  | 87.19  | 77.26  | 65.86  | 83.72  | 66.5   | 24.38  | 68.14  |
-| Llama-3.1-70B     | 69.99  | 32.91  | 82.97  | 76.96  | 94.47  | 87.99  | 56.17  | 85.15  | 76.4   | 46.50  | 66.83  |
-| Llama-3.3-70B     | 72.96  | 36.48  | 85.79  | 77.99  | 93.56  | 90.76  | 71.84  | 85.85  | 70.4   | 48.24  | 66.11  |
 ## License and use

 ## Performance
+| Model | Average | 2 LC | BBH | DROP | GSM8k | IFEval | MATH | MMLU | Safety | PopQA | TruthQA |
+|-------|---------|------|-----|------|-------|--------|------|------|--------|-------|---------|
+| **Closed API models** |  |  |  |  |  |  |  |  |  |  |  |
+| GPT-3.5 Turbo 0125 | 59.6 | 38.7 | 66.6 | 70.2 | 74.3 | 66.9 | 41.2 | 70.2 | 69.1 | 45.0 | 62.9 |
+| GPT 4o Mini 2024-07-18 | 65.7 | 49.7 | 65.9 | 36.3 | 83.0 | 83.5 | 67.9 | 82.2 | 84.9 | 39.0 | 64.8 |
+| **Open weights models** |  |  |  |  |  |  |  |  |  |  |  |
+| Mistral-Nemo-Instruct-2407 | 50.9 | 45.8 | 54.6 | 23.6 | 81.4 | 64.5 | 31.9 | 70.0 | 52.7 | 26.9 | 57.7 |
+| Ministral-8B-Instruct | 52.1 | 31.4 | 56.2 | 56.2 | 80.0 | 56.4 | 40.0 | 68.5 | 56.2 | 20.2 | 55.5 |
+| Gemma-2-27b-it | 61.3 | 49.0 | 72.7 | 67.5 | 80.7 | 63.2 | 35.1 | 70.7 | 75.9 | 33.9 | 64.6 |
+| Qwen2.5-32B | 66.5 | 39.1 | 82.3 | 48.3 | 87.5 | 82.4 | 77.9 | 84.7 | 82.4 | 26.1 | 70.6 |
+| Mistral-Small-24B | 67.6 | 43.2 | 80.1 | 78.5 | 87.2 | 77.3 | 65.9 | 83.7 | 66.5 | 24.4 | 68.1 |
+| Llama-3.1-70B | 70.0 | 32.9 | 83.0 | 77.0 | 94.5 | 88.0 | 56.2 | 85.2 | 76.4 | 46.5 | 66.8 |
+| Llama-3.3-70B | 73.0 | 36.5 | 85.8 | 78.0 | 93.6 | 90.8 | 71.8 | 85.9 | 70.4 | 48.2 | 66.1 |
+| Gemma-3-27b-it | - | 63.4 | 83.7 | 69.2 | 91.1 | - | - | 81.8 | - | 30.9 | - |
+| **Fully open models** |  |  |  |  |  |  |  |  |  |  |  |
+| OLMo-2-7B-1124-Instruct | 55.7 | 31.0 | 48.5 | 58.9 | 85.2 | 75.6 | 31.3 | 63.9 | 81.2 | 24.6 | 56.3 |
+| OLMo-2-13B-1124-Instruct | 61.4 | 37.5 | 58.4 | 72.1 | 87.4 | 80.4 | 39.7 | 68.6 | 77.5 | 28.8 | 63.9 |
+| **OLMo-2-32B-0325-SFT** | 61.7 | 16.9 | 69.7 | 77.2 | 78.4 | 72.4 | 35.9 | 76.1 | 93.8 | 35.4 | 61.3 |
+| **OLMo-2-32B-0325-DPO** | 68.8 | 44.1 | 70.2 | 77.5 | 85.7 | 83.8 | 46.8 | 78.0 | 91.9 | 36.4 | 73.5 |
+| **OLMo-2-32B-0325-Instruct** | 68.8 | 42.8 | 70.6 | 78.0 | 87.6 | 85.6 | 49.7 | 77.3 | 85.9 | 37.5 | 73.2 |
+## Benchmark Descriptions
+- **2 LC**: Two-step logical constraints reasoning
+- **BBH**: Big Bench Hard tasks
+- **DROP**: Discrete Reasoning Over Paragraphs
+- **GSM8k**: Grade School Math 8k problems
+- **IFEval**: Instruction Following Evaluation
+- **MATH**: Mathematics problem-solving
+- **MMLU**: Massive Multitask Language Understanding
+- **Safety**: Safety and harmlessness evaluation
+- **PopQA**: Popular Question Answering
+- **TruthQA**: Truthfulness in question answering
+*Note: Replace the "Your Model" row with your model's evaluation results. You can add additional information about your model's performance compared to others.*
 ## License and use