Abstract
Emotion recognition in text, the task of identifying emotions such as joy or
anger, is a challenging problem in NLP with many applications. One of the
challenges is the shortage of available datasets that have been annotated with
emotions. Certain existing datasets are small, follow different emotion
taxonomies and display imbalance in their emotion distribution. In this work,
we studied the impact of data augmentation techniques precisely when applied to
small imbalanced datasets, for which current state-of-the-art models (such as
RoBERTa) under-perform. Specifically, we utilized four data augmentation
methods (Easy Data Augmentation EDA, static and contextual Embedding-based, and
ProtAugment) on three datasets that come from different sources and vary in
size, emotion categories and distributions. Our experimental results show that
using the augmented data when training the classifier model leads to
significant improvements. Finally, we conducted two case studies: a) directly
using the popular chat-GPT API to paraphrase text using different prompts, and
b) using external data to augment the training set. Results show the promising
potential of these methods.