Methods

How can we decide whether an AI tool or system is effective for language learning?

Ben Knight

27 May 2026 7 min read

What is this all about?

From Day 1 of ChatGPT, it was clear that Generative AI was going to impact on the way we learn languages. Students all over the world would have free access to tools that can communicate in the target language as naturally as humans. Millions of us involved in language education have experimented with various AI-based tools over the past couple of years. But how can we reliably know how well each tool works for helping language learning? Are we simply the pawns of marketing hype, or blindly stumbling around the AI store trying out a bit of everything?

Oxford University Press, in collaboration with researchers from the University of Oxford, have set out to address this by creating an open-source benchmark tool to aid the pedagogy-led evaluation of AI-based tools relating to language learning and assessment.

The project is highly ambitious in that it does not limit itself to a narrow set of activities that are easy to define and evaluate for an AI tool — such as “produce a worksheet for this grammar point”, “create a test for this listening skill” or “write a lesson plan for teaching this set of vocabulary”. It aims to be able to evaluate any AI-based activity that would help language learning in the future, even if it has not been developed yet.

Human communication is complicated to start with, making it difficult to define what the ideal performance would be for a learner on a communication task, at a particular stage of their learning journey. Language use is about skill development as well as knowledge, and is not always linear in its progress. On top of that, language learning involves dimensions of social relationship and personal identity which can be difficult to identify and manage. Human teachers need to manage these social-emotional aspects of learning, as well as discerning the best way to support the cognitive demands of mastering a new language.

What is L2-Bench?

At the core of ‘L2-Bench’, as it is affectionately called, is a framework of 12 core competencies which aim to capture what is needed to support language learning effectively. The framework draws substantially on teacher competency models such as the EAQUALS Framework for Language Teacher Training and the British Council’s Continuing Professional Development Framework.

The main difference is that this model is designed to capture all the competencies an AI system would need to act as a language learning facilitator — some of which are so natural to human teachers that they don’t need to be included in a PD framework: e.g. act as a conversational exchange partner. It also draws on the Oxford Principles of Language Learning, a framework of pedagogical principles that summarises what language learning research has indicated as being most important for successful language learning. The final L2-Bench competency framework has 31 detailed sub-competencies, grouped into the 12 top-level core competencies.

The L2-Bench competency framework that encompasses the range of roles that intentionally design the conditions that shape how people learn.

L2-Bench is also ambitious in that it doesn’t test knowledge about how to teach; it evaluates an AI’s ability to perform the kinds of professional tasks real practitioners carry out every day. So we designed over 1,000 authentic tasks representing real scenarios educators face.

A demo L2-Bench task item built around the “Lesson Planning” competency, including an AI response, and the rubric used to score it.

One strand of the validation has been evaluating how authentic those tasks are, and how well they capture the range of tasks that teachers around the world face. For example, we have created a framework of “teaching context factors”, so that we can systematically ensure the tasks represent different teaching contexts.

The next component of the Bench is the set of criteria used to evaluate the performance of AI tools on those tasks. Again, it was really important to take a systematic approach to defining these criteria in a transparent way. The criteria represent three different views of the task:

Universal criteria that apply to all tasks — such as whether the response is appropriate for the age of the learners.
Consensus criteria that apply to any task focused on a particular competency — for example, when an AI must provide feedback to a student, a consensus criterion is “feedback includes explanations, models or hints that the learner can use to improve”.
Task-specific criteria that define additional points the response must meet (or avoid) to complete the task successfully.

From these detailed rubrics, the responses to the tasks can be evaluated in a meaningful, consistent way.

How a task rubric is built: create a task around a single competency, take consensus criteria from its tagged sub-competency(s), then complete the rubric based on task context.

Why is this important for language education around the world?

1. Making informed decisions about AI tools

If you’re a teacher or manager wondering whether to adopt an AI assistant, L2-Bench offers evidence grounded in your professional reality — not generic metrics based on other types of activity.

2. Helping EdTech teams build better products

Developers rarely know which aspects of pedagogy are hardest to get right. This framework gives concrete targets and reveals where models genuinely struggle.

3. Supporting safe, effective experimentation

Instead of assuming AI “works” or “doesn’t work”, we can finally ask: For which teaching tasks does it work? For whom? Under what conditions?

4. Elevating professional expertise

Rather than replacing teachers, this benchmark makes their expertise the foundation of how we evaluate AI. It translates tacit professional knowledge into testable, transparent criteria.

What we found so far

We started with a pilot study — run in partnership with the University of Birmingham, UK — to get an independent assessment of the authenticity of the tasks and the appropriacy of the evaluation criteria. The results from the 39 postgraduate evaluators pointed to improvements we needed to make:

Tasks were rated as highly authentic (average 4.24/5).
Criteria for evaluating AI responses were seen as good, but with room for refinement.
Some professional areas — especially feedback, language presentation and assessment — proved particularly demanding for consistent scoring.
Even human evaluators often disagreed, highlighting just how complex these teaching decisions really are.

We then set up a more extensive validation project with over 200 language education practitioners from around the world — including teachers, assessors, learning designers and advanced learners from a wide range of countries. For this study, we asked participants to rate task authenticity and criteria adequacy as before, but for a much larger set of tasks. We also asked them to compare AI-generated and expert reference responses for each task. Finally, we asked them to score the AI responses against the rubric. We’re planning to publish the results of this study in June 2026.

Read more about our methods in our paper: Towards an Evaluation Methodology for AI in Second Language Education: Lessons Learned from Developing L2-Bench.

Read Paper

Get involved

This L2-Bench project is designed for practitioners, not just researchers. Oxford University Press will be making the benchmark available freely as an open-source tool that can be used by anyone involved in language education, and so would love to engage with as wide a range of practitioners as possible.

If you’re curious, sceptical, excited — or all three — you’re exactly the kind of voice we want involved.

If you’d like to contribute to the next validation phase, or simply want updates as we release findings, please get in touch via the “Register Interest” form below.

AI in education is moving fast. Let’s make sure our evaluations keep up — and reflect the real work teachers do every day.