avemio
/

German-RAG-NEMO-12B-ORPO-HESSIAN-AI

@@ -134,17 +134,16 @@ Four evaluation metrics were employed across all subsets: language quality, over
 -   **Overall score:** This metric combined the results from the previous three metrics, offering a comprehensive evaluation of the model's capabilities across all subsets.
-| Metric                                    | [Vanila-Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) | [GRAG-NEMO-SFT](https://huggingface.co/avemio/GRAG-NEMO-12B-SFT-HESSIAN-AI) | **[GRAG-NEMO-ORPO](https://huggingface.co/avemio/GRAG-NEMO-12B-ORPO-HESSIAN-AI)** | [GRAG-NEMO-MERGED]() | GPT-3.5-TURBO |
-|------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|-----------------------------|----------------|
-| Average Language Quality             | 85.88                                                                          | 89.61                                                                     | **89.1**                                                                                          |                             |                |
-| **OVERALL SCORES (weighted):**       |                                                                            |                                                                            |                                                                                           |                             |               |
-| extraction_recall       | 35.2                                                                           | 52.3                                                                           | **48.8**                                                                                          |                             |                |
-| qa_multiple_references | 65.3                                                                           | 71.0                                                                           | **74.0**                                                                                          |                             |                |
-| qa_without_time_difference | 71.5                                                                           | 85.6                                                                           | **85.6**                                                                                          |                             |                |
-| qa_with_time_difference | 65.3                                                                           | 87.9                                                                           | **85.4**                                                                                          |                             |                |
-| reasoning              | 69.4                                                                           | 71.5                                                                           | **73.4**                                                                                          |                             |                |
-| relevant_context       | 71.3                                                                           | 69.1                                                                           | **65.5**                                                                                          |                             |                |
-| summarizations         | 73.8                                                                           | 81.6                                                                           | **80.3**                                                                                          |                             |                |
 ## Model Details

 -   **Overall score:** This metric combined the results from the previous three metrics, offering a comprehensive evaluation of the model's capabilities across all subsets.
+| Metric                                    | [Vanila-Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) | [GRAG-NEMO-SFT](https://huggingface.co/avemio/GRAG-NEMO-12B-SFT-HESSIAN-AI) | **[GRAG-NEMO-ORPO](https://huggingface.co/avemio/GRAG-NEMO-12B-ORPO-HESSIAN-AI)** | GPT-3.5-TURBO |
+|------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|----------------|
+| Average Language Quality             | 85.88                                                                          | 89.61                                                                     | **89.1**                                                                                          | 91.86                                            |
+| **OVERALL SCORES (weighted):**       |                                                                            |                                                                            |                                                                                           |                                            |
+| extraction_recall       | 35.2                                                                           | 52.3                                                                           | **48.8**                                                                                          |          87.2                                   |
+| qa_multiple_references | 65.3                                                                           | 71.0                                                                           | **74.0**                                                                                          |          77.2                                   |
+| qa_without_time_difference | 71.5                                                                           | 85.6                                                                           | **85.6**                                                                                          |         83.1                                   |
+| qa_with_time_difference | 65.3                                                                           | 87.9                                                                           | **85.4**                                                                                          |              83.2                               |
+| relevant_context       | 71.3                                                                           | 69.1                                                                           | **65.5**                                                                                          |           89.5                                 |
+| summarizations         | 73.8                                                                           | 81.6                                                                           | **80.3**                                                                                          |                     86.9                        |
 ## Model Details