Abstract
Transformer-based architectures have become the dominant approach for a wide array of machine learning tasks, including those in computer vision. Consequently, the prevalence of purely convolutional networks-particularly shallow-depth architectures for classification-has been in decline. In this work, we revisit Convolutional Neural Networks (CNNs) and propose a modern hybrid architecture that integrates Transformer-inspired components. Specifically, we introduce MLP Fusion, a model that incorporates Multi-Layer Perceptron (MLP) blocks, similar to those used in Vision Transformers, into CNN backbones prior to the classification stage. Additionally, we include intermediate 1 \times 1 convolutional layers within the backbone. This fusion is intended to enhance the representational capacity of CNNs by enriching their embedding space. Experimental evaluations on the CIFAR-10 and CIFAR-100 datasets show that MLP Fusion achieves better performance compared to compact CNN models reported in the literature.