Update README.md
Browse files
README.md
CHANGED
@@ -134,17 +134,16 @@ Four evaluation metrics were employed across all subsets: language quality, over
|
|
134 |
- **Overall score:** This metric combined the results from the previous three metrics, offering a comprehensive evaluation of the model's capabilities across all subsets.
|
135 |
|
136 |
|
137 |
-
| Metric | [Vanila-Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) | [GRAG-NEMO-SFT](https://huggingface.co/avemio/GRAG-NEMO-12B-SFT-HESSIAN-AI) | **[GRAG-NEMO-ORPO](https://huggingface.co/avemio/GRAG-NEMO-12B-ORPO-HESSIAN-AI)** |
|
138 |
-
|
139 |
-
| Average Language Quality | 85.88 | 89.61 | **89.1** |
|
140 |
-
| **OVERALL SCORES (weighted):** | | | |
|
141 |
-
| extraction_recall | 35.2 | 52.3 | **48.8** |
|
142 |
-
| qa_multiple_references | 65.3 | 71.0 | **74.0** |
|
143 |
-
| qa_without_time_difference | 71.5 | 85.6 | **85.6** |
|
144 |
-
| qa_with_time_difference | 65.3 | 87.9 | **85.4** |
|
145 |
-
|
|
146 |
-
|
|
147 |
-
| summarizations | 73.8 | 81.6 | **80.3** | | |
|
148 |
|
149 |
## Model Details
|
150 |
|
|
|
134 |
- **Overall score:** This metric combined the results from the previous three metrics, offering a comprehensive evaluation of the model's capabilities across all subsets.
|
135 |
|
136 |
|
137 |
+
| Metric | [Vanila-Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) | [GRAG-NEMO-SFT](https://huggingface.co/avemio/GRAG-NEMO-12B-SFT-HESSIAN-AI) | **[GRAG-NEMO-ORPO](https://huggingface.co/avemio/GRAG-NEMO-12B-ORPO-HESSIAN-AI)** | GPT-3.5-TURBO |
|
138 |
+
|------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|----------------|
|
139 |
+
| Average Language Quality | 85.88 | 89.61 | **89.1** | 91.86 |
|
140 |
+
| **OVERALL SCORES (weighted):** | | | | |
|
141 |
+
| extraction_recall | 35.2 | 52.3 | **48.8** | 87.2 |
|
142 |
+
| qa_multiple_references | 65.3 | 71.0 | **74.0** | 77.2 |
|
143 |
+
| qa_without_time_difference | 71.5 | 85.6 | **85.6** | 83.1 |
|
144 |
+
| qa_with_time_difference | 65.3 | 87.9 | **85.4** | 83.2 |
|
145 |
+
| relevant_context | 71.3 | 69.1 | **65.5** | 89.5 |
|
146 |
+
| summarizations | 73.8 | 81.6 | **80.3** | 86.9 |
|
|
|
147 |
|
148 |
## Model Details
|
149 |
|