How QIMMA Is Fixing Arabic LLM Evaluation Quality

The landscape of Arabic Large Language Model (LLM) evaluation has become increasingly complex, with a proliferation of benchmarks and leaderboards. This rapid growth, however, has often raised a critical question: are we truly measuring what we intend to measure when assessing Arabic language capabilities?

Introducing QIMMA قمّة (Arabic for “summit”), an innovative platform designed to systematically address this challenge. Unlike previous approaches that simply aggregated existing benchmarks, QIMMA implements a rigorous quality validation pipeline before any model evaluation takes place. Our findings were eye-opening, revealing that even widely-used Arabic benchmarks often contain systematic quality issues that can significantly skew evaluation results.

The Challenge of Evaluating Arabic LLMs

Arabic, a language spoken by over 400 million people across diverse dialects and cultural contexts, presents unique challenges for NLP evaluation. The existing ecosystem has often been fragmented, suffering from several critical pain points that hinder accurate model assessment.

Many Arabic benchmarks are direct translations from English, which can introduce significant distributional shifts. Questions that feel natural in English might become awkward, culturally misaligned, or even nonsensical when directly translated into Arabic, making the data less representative of natural language use. Furthermore, both translated and native Arabic benchmarks frequently lack rigorous quality checks, leading to annotation inconsistencies, incorrect gold answers, encoding errors, and even cultural biases in ground-truth labels.

Reproducibility has also been a major hurdle, with evaluation scripts and per-sample outputs rarely made public. This makes it difficult for researchers to audit results, build upon prior work, or thoroughly understand model failures. Current leaderboards often cover isolated tasks and narrow domains, making a holistic assessment of an LLM’s Arabic proficiency incredibly difficult.

QIMMA stands apart by combining several crucial properties, making it a truly unique and comprehensive resource:

Open Source: Promoting transparency and collaborative development.
Predominantly Native Arabic Content: Ensuring cultural and linguistic authenticity.
Systematic Quality Validation: A core principle for reliable assessment.
Code Evaluation: Addressing a critical, often overlooked domain.
Public Per-Sample Inference Outputs: Enhancing reproducibility and detailed analysis.

QIMMA’s Unique Quality Validation Pipeline

The methodological heart of QIMMA lies in its multi-stage quality validation pipeline, applied to every sample across all benchmarks before any model evaluation. This proactive approach ensures that the scores reflect genuine Arabic language capability, not the inherent flaws of the evaluation data.

Our process begins with Stage 1: Multi-Model Automated Assessment. Each sample is independently evaluated by two state-of-the-art LLMs known for their strong Arabic capabilities but distinct training data compositions. These models score each sample against a 10-point rubric, assessing criteria like clarity, cultural appropriateness, factual accuracy, and absence of errors. Samples scoring below 7/10 by either model are flagged for further review.

Stage 2: Human Annotation and Review then kicks in for flagged samples. Native Arabic speakers, selected for their cultural and dialectal familiarity, meticulously review these samples. They make final calls on correctness, cultural sensitivity, and overall quality, ensuring nuanced understanding, especially where “correctness” might genuinely vary across Arab regions.

This rigorous pipeline uncovered recurring, systematic quality issues across many established benchmarks, highlighting gaps in their original construction. Our analysis revealed thousands of problematic samples, with discard rates varying significantly across benchmarks.

Answer Quality: Issues like false or mismatched gold indices, factually incorrect answers, or missing/raw text answers.
Data Quality: Corrupt or illegible text, pervasive spelling and grammar errors, or duplicate samples.
Cultural Bias: Reinforcement of stereotypes and monolithic generalizations about diverse Arab communities.
Protocol Mismatch: Misalignment of gold answers with the specified evaluation protocols.

For code benchmarks, a different approach was necessary. Instead of discarding samples, we refined the Arabic problem statements in 3LM’s Arabic adaptations of HumanEval+ and MBPP+. This ensured clarity and accuracy in the problem descriptions while keeping the original task identifiers, reference solutions, and test suites completely unchanged.

Unveiling the QIMMA Leaderboard: Key Insights

With a meticulously validated suite of over 52,000 samples consolidated from 14 source benchmarks across 7 domains, QIMMA provides an unprecedentedly reliable platform for evaluating Arabic LLMs. We leverage robust frameworks like LightEval, EvalPlus, and FannOrFlop, standardizing prompting and metrics to ensure consistency and reproducibility.

Our initial evaluation of 46 open-source models, spanning Arabic-specialized and multilingual architectures from 1B to 400B parameters, revealed fascinating insights into their true Arabic capabilities. The leaderboard clearly showcases which models excel in various linguistic and reasoning tasks across domains like Culture, STEM, Legal, Medical, Poetry, and Safety.

Jais-2-70B-Chat leads overall, achieving a top score of 65.81 and ranking first in Cultural, STEM, Legal, and Safety domains. This demonstrates that dedicated Arabic training can yield significant gains across a broad multi-domain evaluation.
Qwen2.5-72B-Instruct is a close second (65.75), showcasing strong general-purpose multilingual capability that remains highly competitive even against specialized Arabic models.
Llama-3.3-70B-Instruct leads in the Medical domain, scoring 55.56, highlighting its robust performance in specialized factual recall despite being a general multilingual model.
Qwen3.5-27B surprisingly leads in Coding (63.39), suggesting that strong reasoning capabilities can sometimes outweigh sheer model scale for complex tasks.
gemma-3-27b-it excels in Poetry (59.74), demonstrating an impressive understanding of Arabic poetic language and literary structure.

A persistent challenge remains in coding, where most Arabic-specialized models score below 35, while multilingual models generally perform better. This indicates that Arabic code instruction following is still an open area for improvement. While a clear size-performance correlation emerged across the full leaderboard, demonstrating that larger models generally perform better, QIMMA’s detailed results also highlight intriguing exceptions where smaller or specialized models outperform their larger counterparts in specific domains.

QIMMA represents a significant leap forward in Arabic LLM evaluation by prioritizing data quality above all else. By providing a rigorously validated, open-source platform with public outputs, we aim to foster transparency, reproducibility, and genuinely impactful progress in Arabic AI research.

Explore the full results and dive deeper into the methodology. Your contributions and feedback are invaluable as we continue to refine and expand this critical resource for the Arabic AI community.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.

The Challenge of Evaluating Arabic LLMs

QIMMA’s Unique Quality Validation Pipeline

Unveiling the QIMMA Leaderboard: Key Insights

Kristine Vior

Related Posts