Comparative evaluation of GPT‑OSS‑20B vs GPT‑OSS‑120B on Arabic & ILMAAM benchmarks
Introduction
Large language models (LLMs) are typically evaluated on multi‑task multiple‑choice benchmarks to assess how well they can reason across a wide range of domains. The Measuring Massive Multitask Language Understanding (MMLU) benchmark is one of the most widely used evaluation suites for this purpose. It contains 15 908 multiple‑choice questions across 57 subjects ranging from STEM fields and international law to nutrition and religion:contentReference. MMLU was released in 2020 and quickly became a standard benchmark for comparing the knowledge and reasoning capabilities of large language models. Despite its widespread use, MMLU is known to contain ground‑truth errors – a manual analysis suggested that about 6.5 % of questions contain mistakes or multiple correct answers, meaning perfect accuracy is unattainable.
Because most benchmarks are English‑centric, researchers have started constructing multilingual alternatives. ArabicMMLU mirrors the structure of the MMLU benchmark but focuses on Modern Standard Arabic. It contains 14 575 multiple‑choice questions collected from school exams in eight Arabic‑speaking countries (Morocco, Egypt, Jordan, Palestine, Lebanon, UAE, Kuwait and Saudi Arabia). Questions span 40 tasks covering STEM, social sciences, humanities and language subjects; more than half of them involve Arabic‑specific content such as history, geography and law. Each question provides 2–5 answer options and the correct answer is marked. The creators hired native Arabic speakers to collect and verify the data, achieving a reported data accuracy of 96 %
The ILMAAM evaluation (Index for Language Models for Arabic Assessment on Multitasks) is a smaller benchmark built to assess language models’ knowledge of Arabic and Islamic topics. It includes subjects such as Islamic ethics, Islamic history, Islamic religion and old Arab history. The dataset is not yet widely documented in the literature, but the results shared by the user allow us to compare model performance on this evaluation.
This report compares the performance of two open‑source generative models—GPT‑OSS‑20B and GPT‑OSS‑120B—on the MMLU, ArabicMMLU and ILMAAM benchmarks. GPT‑OSS‑120B is a much larger model than GPT‑OSS‑20B (120 billion parameters vs. 20 billion), so we expect it to demonstrate improved knowledge and reasoning abilities. All evaluations were carried out in a zero‑shot setting using multiple‑choice accuracy as the metric.
Data and methodology
The evaluation results for each model were provided in JSON files. For the MMLU benchmark we obtained per‑subject accuracies and overall averages for each model. For ArabicMMLU we used the subset accuracies corresponding to coarse subject groups (e.g., Arabic Language (General) and Islamic Studies (High School)). The ILMAAM files contained average accuracy and per‑subject results. We parsed these JSON files, calculated differences between models, and produced visualisations illustrating performance trends. Wherever possible we computed averages across subsets to obtain an overall picture.
Results on the MMLU benchmark
Overall performance. GPT‑OSS‑20B achieved an average accuracy of 74.88 %, while GPT‑OSS‑120B scored 83.52 %, yielding an overall improvement of 8.64 percentage points. The distribution of improvements across all 56 subjects (excluding the overall average) is shown below. Most subjects benefit from the larger model, though a few (e.g., virology and abstract algebra) show slight declines.
Subject‑level differences. The following chart plots the top 15 MMLU subjects with the largest improvements when moving from 20B to 120B. Each pair of bars shows the accuracy of GPT‑OSS‑20B (blue) and GPT‑OSS‑120B (orange) on a given subject.
The largest gains occur in subjects that require specialised knowledge or deeper reasoning. For example, anatomy improves by 38 points (from 46 % to 84 %), professional accounting by 30 points (58 % → 88 %), and clinical knowledge by 26 points. Improvements of 20 points or more also occur in management, marketing, high‑school physics and astronomy. Only a handful of subjects show small declines (e.g., virology drops by 6 points). These declines may reflect noise in the MMLU benchmark—recall that more than 50 % of questions in the virology subset were reported to have errors:contentReference[oaicite:7]{index=7}.
Results on the ArabicMMLU benchmark
The chart below compares the two models on the nine ArabicMMLU subsets. For each subset we compute the improvement as the difference between the 120B and 20B accuracies.
GPT‑OSS‑120B consistently outperforms its smaller counterpart on every subset. The average accuracy across all subsets increases from ~58 % to 74.5 %, an improvement of 16.25 percentage points. The largest gain appears in Arabic Language (Middle School), which jumps by 48 points, suggesting that the larger model handles intermediate‑level Arabic language tasks much better. Significant improvements also occur in Islamic Studies (+19.7 points) and Arabic Language (Primary School) (+16.7 points). Even the smallest improvement—about 6.7 points in **Islamic Studies (Middle School)**—is notable. These results demonstrate that scaling up the model drastically enhances its ability to answer Arabic multiple‑choice questions across linguistic and religious domains.
Results on the ILMAAM benchmark
The ILMAAM benchmark contains five Islamic‑themed subjects. The following chart shows the accuracies of both models and the corresponding improvements.
GPT‑OSS‑120B surpasses the 20‑billion‑parameter model across all ILMAAM subjects. The largest improvement is in Islamic History (+18.8 points), followed by Islamic Ethics (+15.5 points) and Islamic Religion (+14.4 points). Even the smallest gain—about 9 points in Educational Methodologies—is substantial. Averaging over subjects, the accuracy rises from 72.74 % to 87.38 %, an improvement of 14.64 percentage points. The near‑saturation performance (close to 100 % on Islamic Ethics) suggests that the 120B model has learned many of the factual aspects of these topics.
Discussion
The comparison between GPT‑OSS‑20B and GPT‑OSS‑120B across three benchmarks highlights the benefits of model scaling. On the large English‑centric MMLU benchmark, the 120B model delivers a relative improvement of about 11.5 % (8.64 absolute points), with the biggest gains in subjects that require specialised biomedical or professional knowledge. The histogram of improvements shows that most subjects benefit, while small declines correlate with known data quality issues in MMLU.
Performance gains are even more pronounced on the Arabic‑focused benchmarks. ArabicMMLU covers 40 tasks and uses authentic school exam questions from eight countries. GPT‑OSS‑120B achieves large improvements on all nine high‑level subsets, doubling the accuracy in some cases. The improvements are likely due to the larger model’s increased capacity and more extensive exposure to Arabic data during training. The ILMAAM results mirror this trend: despite the dataset’s narrow focus, the 120B model consistently outperforms the 20B model, even approaching perfect scores on some subjects.
It is important to interpret these results in the context of each benchmark’s limitations. MMLU contains a non‑negligible fraction of erroneous questions, so raw accuracies may not reflect true reasoning ability. ArabicMMLU is comprehensive but still focuses primarily on Modern Standard Arabic and over‑represents certain countries; performance on other dialects or countries may differ. ILMAAM is still an emerging benchmark with little public documentation, so its coverage and difficulty are unclear. Nevertheless, the consistent improvements observed here suggest that scaling up model size substantially enhances multilingual and domain‑specific knowledge.
Conclusion
Across the MMLU, ArabicMMLU and ILMAAM benchmarks, GPT‑OSS‑120B outperforms GPT‑OSS‑20B by a wide margin. Average accuracy improvements range from about 8.6 percentage points on MMLU to over 16 points on ArabicMMLU. The biggest gains occur in specialised subjects like anatomy, professional accounting and middle‑school Arabic language. These results underscore two key insights: (1) increasing model size leads to broad improvements in knowledge and reasoning, and (2) larger models are particularly beneficial for non‑English evaluations, where additional capacity helps bridge linguistic and cultural gaps. Future work should continue to refine evaluation benchmarks—especially to reduce errors in MMLU and expand the coverage of Arabic evaluations—and explore ways to achieve similar performance gains without exponentially scaling model size.
By: Omer Nacar www.omarai.me