Home
Projects
People
Publications
Coding Aperitivo
Reading Group
Join us
Contact
Evaluation
My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by …
Cite
×