Prompt Engineering for evaluators: optimizing LLMs to judge linguistic proficiency

Authors

DOI:

https://doi.org/10.62408/ai-ling.v2i2.22

Keywords:

Large Language Models, Prompt Engineering, LLM-as-a-judge, evaluation

Abstract

Prompt Engineering, the practice of optimizing the question made to a Large Language Model, is closely linked to the evaluation procedures. Depending on the type of task we are performing through LLMs, we can have an evaluation metric with high or low reliability, making Prompt Engineering more or less effective. LLM-as-a-judge represents a possible solution to perform Prompt Engineering in tasks that are hard to evaluate, although the reliability of this practice is not granted, depending on the task and the language model. This paper presents an evaluation of general purpose LLMs in an essay-scoring task using state-of-the-art small models. In particular, the ability of language models to assign proficiency levels to short essays written by Italian L2 learners is evaluated. Test data with expert annotations of CEFR scores are extracted from Kolipsi-II corpus. Several prompting techniques have been used to analyze the impact of Prompt Engineering on this task. Results show a wide difference in accuracy among the three LLMs and that choosing the right prompt radically changes their rating abilities.

Published

2025-07-25

How to Cite

Gregori, L. (2025). Prompt Engineering for evaluators: optimizing LLMs to judge linguistic proficiency. AI-Linguistica. Linguistic Studies on AI-Generated Texts and Discourses, 2(2). https://doi.org/10.62408/ai-ling.v2i2.22