L2-Bench
Evaluating AI Capabilities in Language Education
Empowering researchers, institutions, and educators to make informed decisions about AI for language education.
L2-Bench is the first comprehensive, open evaluation benchmark specifically designed for measuring LLM capabilities in second language (L2) education. Initiated at Oxford University Press in 2025, our team of pedagogy experts, data scientists and AI researchers have developed L2-Bench in collaboration with researchers from the University of Oxford, and are launching in 2026.
Our Mission
Evaluation benchmarks shape AI: what gets built, what gets improved, and what the world adopts.
As AI adoption in education accelerates, the public capacity to assess AI model performance on educational tasks is more important than ever.
To this end, OUP openly share L2-Bench, a first-of-its-kind evaluation benchmark specifically designed to measure model capabilities across tasks that comprise quality "learning experience design in second language education", helping establish the standard for what good, pedagogically-led evaluation looks like in AI for language learning and assessment globally.
We have validated L2-Bench with over 200 education practitioners across 45 countries representing the dynamics of global pedagogy. Our peer-reviewed research methodology, benchmark dataset of 1,000 items, and L2-Bench leaderboard will be made available shortly, with the dataset and leaderboard actively maintained to include the latest model results and support an AI for Education evaluation ecosystem.
L2-Bench provides education stakeholders better methods to make more informed decisions about AI for Education adoption, use, and governance, while advancing the maturing science of AI evaluations for education. OUP will use it to rigorously validate our own use of LLMs in our products, ensuring that the LLMs that we do employ are appropriately evaluated for use specifically in English language teaching contexts.
For Educators
Practical AI capability assessment for specific teaching scenarios.
For Institutions
Informed decision-making when selecting AI tools.
For the Field
Accelerated development of effective AI-powered educational systems.
L2-Bench Contributions
Competency-based
A "learning experience designer in second language education" construct spanning 12 core competencies and 31 sub-competencies derived from established language teaching frameworks used to create tasks.
Validated Dataset
Over 1,000 rubric-scored task-response ("Q&A") pairs curated in collaboration with pedagogical experts and validated by over 200 global practitioners to ensure alignment with authentic education contexts.
Open Leaderboard
Transparent rankings of frontier- and top open-source AI models reported with statistical uncertainty quantification to help the community track AI capabilities in language education.
LLM-as-a-Judge Scoring
Integrates recent methodologies for state-of-the-art AI evaluations with automated scoring systems calibrated by expert practitioner scoring.
Context-specific Methodology
Reproducible methodology allowing systematic task and rubric creation for context-specific evaluations across diverse education scenarios.
Peer-reviewed Research
Papers co-authored with researchers from the University of Oxford on L2-Bench methods, validation and results submitted for conference publication.
L2-Bench Tasks
Explore a ”toy example” of an L2-Bench task item: this example item demonstrates how a general-purpose AI model response is evaluated against a rubric of binary criteria (yes/no) for a task built around the "Lesson Planning" competency. At this current time, L2-Bench tasks comprise UK/US L2 English learning experiences.
L2-Bench Competencies
Explore the 12 core competencies and their sub-competencies that define an effective "learning experience designer in second language education"—this term is used to encompass the range of roles that intentionally design the conditions that shape how people learn: teachers, materials developers (content or assessment creators), learning designers, and teacher trainers; aiming to capture what is needed to support language learning effectively.
Competency Hierarchy
L2-Bench Task Rubrics
From competency construct to task rubrics: explore a ”toy demo” of how we systematically build tasks and their rubrics around a single competency, enabling granular criteria whilst also maintaining connection to broader "consensus" in language pedagogy.
L2-Bench Scoring
Explore how we evaluate AI responses to tasks at scale: for each response to a task, a calibrated autoscorer (LLM-as-a-Judge) determines whether a criterion is present (Yes/No) in the response for all criteria in the task rubric, with points applied only if the criterion is present in the response. This makes scoring simple, consistent, and enables reliable automated scoring of open-ended responses for multiple AI models across 1,000+ tasks.
Benchmark Your AI for Language Education
”Register Interest” to be the first to be notified on the L2-Bench dataset release, receive updates, and be part of a community of researchers, educators, and institutions evaluating AI for language education. We welcome any general enquiries, collaboration opportunities, and feedback.