Logo image
Do LLMs Outperform Fine-tuned Transformers in Emotion Classification?: A Case Study of Llama and RoBERTa on an Emotion Benchmark
Journal article   Open access   Peer reviewed

Do LLMs Outperform Fine-tuned Transformers in Emotion Classification?: A Case Study of Llama and RoBERTa on an Emotion Benchmark

Timothy Meinert and Anna Koufakou
The International FLAIRS Conference Proceedings, Vol.39(1)
05-06-2026

Abstract

Generative large language models (LLMs) are often assumed to outperform earlier transformer-based encoders across NLP tasks, yet this has not been adequately tested for emotion classification. Using a recently introduced multi-dataset emotion benchmark, we compare a Llama-based generative model with previously reported results from a fine-tuned RoBERTa classifier. The zero-shot LLM consistently underperforms while few-shot prompting substantially improves LLM performance for several datasets. These findings challenge the assumption that LLMs universally surpass older transformers and highlight the continued relevance of fine-tuned models for emotion classification. At the same time, they show that few-shot prompting can unlock competitive LLM performance without the need for task-specific training but not for all datasets.
pdf
Article PDF193.69 kBDownloadView
Open Access CC BY-NC V4.0
url
Link to presentation.View
Published (Version of record) Open

Related links

Metrics

2 Record Views

Details

Logo image