Abstract
Autism Spectrum Disorder (ASD) is a neurodevel-opmental disorder that affects social interaction, communication, and behavior. Recent studies have shown that vocal and auditory features can be used to identify people with ASD. However, existing approaches are limited by their subjective nature, reliance on expert interpretation, and the laborious process of data gathering. This study presents a deep learning-based technique for ASD detection, which utilizes a hybrid vision transformer and convolutional neural network (CNN) architecture. The Swin transformer extracts high-level features and attention maps from audio samples, which the CNN uses as input. We trained and evaluated our model using audio samples from both ASD and typically developing (TD) children, achieving competitive accuracy in distinguishing between the two groups. Our findings suggest that the proposed method has the potential to complement existing diagnostic tools and improve early ASD detection in children. Moreover, our results indicate that deep learning-based techniques and standard diagnostic tools can offer a reliable and objective approach to detecting ASD. The suggested model could be implemented in clinical settings as a screening tool to aid in the early diagnosis of ASD.