Evaluation

Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance

Expert persona prompting—assigning roles such as expert in math to language models—is widely used for task improvement. However, prior work shows mixed results on its effectiveness, and does not consider when and why personas should improve …

My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by …