|
from dataclasses import dataclass |
|
from enum import Enum |
|
|
|
@dataclass |
|
class Task: |
|
benchmark: str |
|
metric: str |
|
col_name: str |
|
|
|
|
|
|
|
|
|
class Tasks(Enum): |
|
|
|
task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "WinoGrande-IS (3-shot)") |
|
task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED") |
|
task2 = Task("icelandic_inflection_all", "exact_match,get-answer", "Inflection (1-shot)") |
|
task5 = Task("icelandic_belebele", "exact_match,get-answer", "Belebele (IS)") |
|
task6 = Task("icelandic_arc_challenge", "exact_match,get-answer", "ARC-Challenge-IS") |
|
task7 = Task("icelandic_wiki_qa", "lm_judge_score,get-answer", "WikiQA-IS") |
|
|
|
|
|
|
|
|
|
|
|
|
|
TITLE = """<h1 align="center" id="space-title">Icelandic LLM leaderboard</h1>""" |
|
|
|
|
|
INTRODUCTION_TEXT = """ |
|
""" |
|
|
|
|
|
LLM_BENCHMARKS_TEXT = f""" |
|
## New submissions |
|
Do you want your model to be included on the leaderboard? Open a discussion on this repository with the details of your model and we will get back to you. |
|
|
|
## Benchmark tasks |
|
The Icelandic LLM leaderboard evaluates models on several tasks. All of them are set up as generation tasks, where the model's output is compared to the expected output. |
|
This means that models that have not been instruction fine-tuned might perform poorly on these tasks. |
|
|
|
The following tasks are evaluated: |
|
|
|
### WinoGrande-IS |
|
The Icelandic WinoGrande task is a human-translated and localized version of the ~1000 test set examples in the WinoGrande task in English. |
|
Each example consists of a sentence with a blank, and two answer choices for the blank. The task is to choose the correct answer choice using coreference resolution. |
|
The benchmark is designed to test the model's ability to use knowledge and common sense reasoning in Icelandic. For this benchmark, we use 3-shot evaluation. |
|
The Icelandic WinoGrande dataset is described in more detail in the IceBERT paper (https://aclanthology.org/2022.lrec-1.464.pdf). |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-winogrande |
|
|
|
### GED |
|
This is a benchmark for binary sentence-level Icelandic grammatical error detection, adapted from the Icelandic Error Corpus (IEC) and contains 200 examples. |
|
Each example consists of a sentence that may contain one or more grammatical errors, and the task is to predict whether the sentence contains an error. |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-sentences-gec |
|
|
|
### Inflection benchmark |
|
The inflection benchmark tests models' ability to generate inflected forms of 300 Icelandic adjective-noun pairs for all four cases, singular and plural. |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-inflection-all-flat |
|
|
|
### Belebele (IS) |
|
This is the Icelandic subset (900 examples) of the Belebele benchmark, a multiple-choice reading comprehension task. The task is to answer questions about a given passage. |
|
- Link to dataset: https://huggingface.co/datasets/facebook/belebele |
|
|
|
### ARC-Challenge-IS |
|
A machine-translated version of the ARC-Challenge multiple-choice question-answering dataset. For this benchmark, we use the test set which contains 1.23k examples. |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-arc-challenge |
|
|
|
### WikiQA-IS |
|
The Icelandic WikiQA dataset is a collection of 1.9k question-answer pairs from the Icelandic Wikipedia, meant to evaluate models' knowledge of Icelandic culture and history. |
|
They were collected by making GPT-4o generate questions and anwswers |
|
given Icelandic Wikipedia articles as context. All examples were then manually verified and corrected where necessary. For evaluation, we prompt GPT-4o to |
|
compare the generated answer to the original answer for semantic similarity and rate the answer on the following scale: (0, "poor"), (1, "fair"), (2, "excellent"). |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic_wiki_qa |
|
""" |
|
|
|
|