Scholarship list
Journal article
End-to-end human parsing and detection optimized for resource-constrained devices
Published 12-10-2025
Scientific reports, 16, 1, 943
Human parsing, a vital task in human-centric analysis, involves segmenting clothing and body parts for individual association. Existing methods often rely on auxiliary inputs like detection and edge prediction, limiting their suitability for resource-constrained devices. To address this, we propose an end-to-end framework that integrates a transformer based self-attention module to enhance contextual understanding while being optimized for low-resource environments. We also introduce bounding-polygon annotations to facilitate simultaneous detection and parsing. Our method achieves fine-grained results in a single pass, significantly improving inference speed without sacrificing accuracy. Real-world validation on Raspberry Pi demonstrates its effectiveness and efficiency in resource-constrained scenarios.
Conference proceeding
Real-Time Animal Pose Estimation Using Computer Vision Techniques
Published 10-29-2025
International Symposium on Innovations in Intelligent Systems and Applications (Online), 1 - 6
With applications in animal monitoring, veterinary diagnostics, behavioral analysis, and robotics, real-time estimation of animal posture is an area of increasing interest in computer vision. In this work, we propose a method based on deep learning approaches to assess animal positions in real time. Selected for their applicability in both home and agricultural settings, the study centres on four animal classes: chicken, dog, horse, and cow. An important output of this work is a bespoke dataset created especially for posture estimation activities, including annotated videos. Every video in the dataset records continuous movement under different lighting and environmental contexts and runs for fifteen seconds. Keypoints marking important body joints for all types of animals were added to the extracted frames. Using the YOLOv8n posture architecture - which provides a balanced trade-off between speed and accuracy - we performed posture estimation. Although YOLO models are usually optimized for object recognition, we fine-tuned YOLOv8n-Pose to predict both bounding boxes and body keypoints, therefore enabling the real-time identification of intricate postural information. Trained on an annotated dataset using supervised learning, the model was tested on another test set from the same distribution. The proposed model achieves a PosePR mAP at the IoU threshold 0.5 of 99.5% in all classes, according to the experimental data. The dog class showed lower precision and F1 scores; the dog and horse classes showed decreased recall. The model maintains strong performance in real-time even if interclass posture variability and occlusion in video frames present natural difficulties. The system handles video input at an average frame rate enough for monitoring systems to be live. This study emphasizes the need for custom data sets tailored to real-world activities and the viability of employing YOLOv8n-Pose for the estimation of animal posture based on key points. Future directions include growing the data set, increasing keypoint accuracy, and including temporal consistency across frames. The data set is available at this link https://drive.google.com/drive/folders/1xci52bt9IxcYQrq36r2fQBaLvx3cSGHH#
Conference proceeding
Automatic Insect Pest Identification and Recognition for Paddy Crops Pest Control
Published 10-13-2025
International Workshops on Image Processing Theory, Tools, and Applications, 1 - 6
Agriculture is one of the main economic pursuits of the 64 Bangladeshi districts.: in Bangladesh, about seventy percent of the workforce rely on agriculture for their living. Bangladesh's gross national revenue is much enhanced by the country's rice farming but attacks of insect pests have a great impact on rice harvests. Different insect pests require different management measures, so the accurate identification of paddy field insects is a crucial task that allows, e.g., the application of the appropriate poison for specific insect pests. This will also prevent the wasteful use of ineffective insecticides. The main challenge addressed in this work is to detect and instantly segment small harmful insects in paddy fields. To address this problem, we have used deep convolutional neural network (DCNN) learning, based on Mask-RCNN, therefore enabling a technique for visual localisation and classification of agricultural pest insects. We have also developed our own dataset of harmful insect annotated images. In the proposed Mask-RCNN model we used a ResNet101 backbone, which can detect and segment at the same time. The proposed model achieves an AP@0.5 of 85.7%, a mAP of 63.8%, and an AR@10 of 68.5, therefore generating an anticipated accuracy of 75%. ResNet101 performs better on all measures. The suggested approach should be able to identify and classify the small harmful insects, with suitable accuracy, in real-world deployments.
Conference proceeding
MLP Fusion: Revisiting Convolutional Networks with Transformer-Based Insights
Published 10-13-2025
International Workshops on Image Processing Theory, Tools, and Applications, 1 - 6
Transformer-based architectures have become the dominant approach for a wide array of machine learning tasks, including those in computer vision. Consequently, the prevalence of purely convolutional networks-particularly shallow-depth architectures for classification-has been in decline. In this work, we revisit Convolutional Neural Networks (CNNs) and propose a modern hybrid architecture that integrates Transformer-inspired components. Specifically, we introduce MLP Fusion, a model that incorporates Multi-Layer Perceptron (MLP) blocks, similar to those used in Vision Transformers, into CNN backbones prior to the classification stage. Additionally, we include intermediate 1 \times 1 convolutional layers within the backbone. This fusion is intended to enhance the representational capacity of CNNs by enriching their embedding space. Experimental evaluations on the CIFAR-10 and CIFAR-100 datasets show that MLP Fusion achieves better performance compared to compact CNN models reported in the literature.
Conference proceeding
Published 09-10-2025
Innovations in Intelligent Systems and Applications Conference (Online), 1 - 6
2025 Innovations in Intelligent Systems and Applications Conference, 09-10-2025–09-12-2025, Bursa, Turkiye
This paper presents a real-time, bidirectional American Sign Language (ASL) communication system that enables translation between ASL gestures and spoken English. The system integrates computer vision and deep learning to recognize hand signs and utilizes a Unity-based avatar to render spoken input as animated ASL gestures. Designed for accessibility and low-cost deployment, it runs on consumer-grade hardware-comprising a standard webcam, microphone, and mid-range laptop-without requiring specialized equipment such as gloves or depth sensors. A convolutional neural network (CNN) trained on a curated ASL alphabet dataset achieves 92% accuracy in letter recognition, with average response latency below 300 milliseconds. Spoken language is transcribed using the Google Web Speech API and visualized in near real-time. The system supports adaptive retraining through a user feedback loop to enable personalization. Emphasis is placed on inclusive design, practical usability, and potential deployment in VR/AR environments. This paper details the system architecture, methodology, dataset, evaluation metrics, and broader implications, highlighting a real-time, low-cost foundation for scalable and inclusive communication, with current support for alphabet-level gesture recognition and phrase-based ASL avatar responses.
Conference proceeding
Optimized Real-Time Bimodal Sign Language Recognition and Translation System
Published 09-10-2025
Innovations in Intelligent Systems and Applications Conference (Online), 1 - 6
Developing a real-time sign language recognition and translation (SLRT) system is to address the communication barriers faced by the Deaf and Hard-of-Hearing (DHH) community in the workplace. It involves deep learning models, including CNNs and RNNs, and tools like MediaPipe, gTTS, MarianMT, and FastText that will improve translation efficiency and SLRT accuracy. Our dataset comprises two ASL image datasets with diverse backgrounds to ensure that the outcome model can operate effectively in various environments. The ResNet-LSTM model we implemented in our SLRT system has achieved an accuracy of \mathbf{9 9. 9 5 \%} , proving its robustness in SLRT. In terms of our quantitative assessment of the deep learning model, we also conduct assessments with XAI, including LIME, T-SNE, saliency maps, etc. The proposed SLRT system can handle real-time translation with minimal computational resources, ensuring the developed SLRT has high practicality and scalability. At the same time, with the integration of NLP features and the speech-to-text function, our SLRT makes daily interaction highly efficient and easy to use.
Conference proceeding
Published 09-10-2025
Innovations in Intelligent Systems and Applications Conference (Online), 1 - 6
Brain tumors are among the most critical neurological disorders, significantly affecting global health due to their complex pathology and often late diagnosis. Magnetic Resonance Imaging (MRI) is a key diagnostic tool, yet its manual interpretation remains time-consuming and prone to human error. To overcome these issues, we propose NeuroVision-Lite, a lightweight and efficient deep learning framework utilizing MobileNetV3 with transfer learning to detect brain tumors from MRI images. Our approach leverages three sophisticated convolutional neural networks, ConvNeXt-Tiny, MobileNetV3 and EfficientNetB0 architectures, carefully selected and fine-tuned for both performance and efficiency. Extensive experiments demonstrate that our proposed models surpass current state-of-the-art (SOTA) techniques in both accuracy and deployment readiness. Moreover, the proposed model strikes a well-calibrated tradeoff between detection performance and edge-device compatibility. The framework is quantized and exported in deployment-ready formats such as ONNX and PyTorch (*.pt). Furthermore, visual interpretability is enhanced via Grad-CAM-based heatmaps, supporting clinical transparency. The approach aims to support early detection in low-resource and mobile healthcare settings, contributing to improved clinical outcomes through accessible and scalable AI-driven diagnostics.
Conference proceeding
Massive Crowd Pose Estimation Using Deep Learning-Based Techniques
Published 07-21-2025
2025 Multimedia University Engineering Conference (MECON), 1 - 5
This work contributes to the recent advancements in crowd management research, specifically concerning high-density settings, in the context of large gatherings. Video analysis and visual monitoring have become essential for enhancing the safety and security of pilgrimages worldwide. Multi-person posture estimation is crucial for several computer vision applications and has significantly progressed in recent years. Nonetheless, only few methods have tackled the challenge of pose estimation in congested settings, which remains difficult and inescapable in many scenarios. Moreover, current approaches do not provide adequate evaluation criteria for such situations. This study introduces a novel and effective method for tackling the challenge of posture estimation in extensive crowds, accompanied with a new dataset for enhanced algorithm assessment. Our methodology combines several computer vision methods supported by a Mask R-CNN model to precisely separate and evaluate multi-person postures, facilitating the automatic detection of behavioural patterns in large crowds. Our proposed method, with a ResNet101 backbone, on our HAJJ-Crowd videos dataset achieved 70.0 mAP. Our new HAJJ-Crowd video dataset can be used for assessment and testing purposes as it includes instance segmentation and prediction outcomes for several common methodologies.
Book chapter
Assistive Visual Tool: Enhancing Safe Navigation with Video Remapping in AR Headsets
Published 05-12-2025
Computer Vision – ECCV 2024 Workshops, 15634, 356 - 371
Visual Field Loss (VFL) is characterized by blind spots or scotomas that poses detrimental impact on fundamental movement activities of individuals. Addressing the challenges (e.g., low video quality, content loss, high levels of contradiction, and limited mobility assessment) faced by existing Extended Reality (XR) systems as vision aids, we introduce a groundbreaking method that enriches the real-time navigation using Augmented Reality (AR) glasses. Our novel vision aid employs advanced video processing techniques to enhance visual perception in individuals with moderate to severe VFL, bridging the gap to healthy vision. A unique optimal video remapping function, tailored to our selected AR glasses characteristics, dynamically maps live video content to the largest intact region of the Visual Field (VF) map. Our method preserves video quality, minimizing blurriness and distortion. Through a comprehensive empirical user study involving 29 subjects with artificially induced scotomas, statistical analyses of object counting and multi-tasking walking track tests demonstrate the promising performance of our method in enhancing visual awareness and navigation capability in real-time.
Journal article
Published 05-12-2025
European archives of paediatric dentistry, 26, 5
An effective Deep learning (DL) based Early Childhood Caries (ECC) prediction model is crucial for early detection of ECC. This study aims to develop and evaluate a deep learning (DL) based hybrid statistical model for ECC prediction. The study employed a computational cross-sectional design, conducted over a three-year period from March 2021 to March 2024. Data analysis was carried out using a hybrid statistical approach that integrated bootstrap methods, Logistic Regression Modelling (LRM), and Multilayer Feed-Forward Neural Networks (MLFFNN). The sample comprised 157 parent-child pairs, providing a robust dataset for examining the research questions. In the current study, the predictors named, "mother's education" (β : 0.423; p < 0.25), "parent's knowledge of bottle-feeding habit during sleep can cause tooth decay" (β : -1.264; p < 0.25), "attitude towards the importance of oral health as general health" (β : -1.052; p < 0.25) and "parent's self-reported oral pain among their children" (β : -2.107; p < 0.25) showed significant association with ECC. For this model, the Mean Absolute Deviation (MAD) was 0.02211, Predictive Mean Squared Error (PMSE) was 0.07909, and the accuracy level was 99.98%. No significant difference was observed from the t-test between the actual values and the predicted values of the model (p > 0.05). It has been shown that this unique deep learning-based ECC prediction model appears an effective tool with high accuracy and interpretability for ECC prediction. After implementing the oral health intervention program, focusing on the potential predictors of ECC obtained from this innovative model, policymakers could be able to evaluate their prediction models comparing their results with the findings of the current study. This comparison will guide them in understanding, designing, and implementing a more effective intervention program for ECC prevention.