Md Baharul Islam

Associate Professor, Software Engineering, Department of Computing and Software Engineering, U.A. Whitaker College of Engineering, Florida Gulf Coast University

Stereoscopic 3D Processing

Computer Vision

AR/VR Technology for Vision Rehabilitation

Pattern Recognition

Human-Centered Computing

Journal article Peer reviewed

End-to-end human parsing and detection optimized for resource-constrained devices

by Md Imran Hosen, Tarkan Aydin and Md Baharul Islam

Published 12-10-2025

Scientific reports, 16, 1, 943

Human parsing, a vital task in human-centric analysis, involves segmenting clothing and body parts for individual association. Existing methods often rely on auxiliary inputs like detection and edge prediction, limiting their suitability for resource-constrained devices. To address this, we propose an end-to-end framework that integrates a transformer based self-attention module to enhance contextual understanding while being optimized for low-resource environments. We also introduce bounding-polygon annotations to facilitate simultaneous detection and parsing. Our method achieves fine-grained results in a single pass, significantly improving inference speed without sacrificing accuracy. Real-world validation on Raspberry Pi demonstrates its effectiveness and efficiency in resource-constrained scenarios.

Journal article Peer reviewed

Development and evaluation of an early childhood caries prediction model: a deep learning-based hybrid statistical modelling approach

by S Z Eusufzai, N B Jamayet, S Ahmed, M B Islam, W M A W Ahmad and M K Alam

Published 05-12-2025

European archives of paediatric dentistry, 26, 5

An effective Deep learning (DL) based Early Childhood Caries (ECC) prediction model is crucial for early detection of ECC. This study aims to develop and evaluate a deep learning (DL) based hybrid statistical model for ECC prediction. The study employed a computational cross-sectional design, conducted over a three-year period from March 2021 to March 2024. Data analysis was carried out using a hybrid statistical approach that integrated bootstrap methods, Logistic Regression Modelling (LRM), and Multilayer Feed-Forward Neural Networks (MLFFNN). The sample comprised 157 parent-child pairs, providing a robust dataset for examining the research questions. In the current study, the predictors named, "mother's education" (β : 0.423; p < 0.25), "parent's knowledge of bottle-feeding habit during sleep can cause tooth decay" (β : -1.264; p < 0.25), "attitude towards the importance of oral health as general health" (β : -1.052; p < 0.25) and "parent's self-reported oral pain among their children" (β : -2.107; p < 0.25) showed significant association with ECC. For this model, the Mean Absolute Deviation (MAD) was 0.02211, Predictive Mean Squared Error (PMSE) was 0.07909, and the accuracy level was 99.98%. No significant difference was observed from the t-test between the actual values and the predicted values of the model (p > 0.05). It has been shown that this unique deep learning-based ECC prediction model appears an effective tool with high accuracy and interpretability for ECC prediction. After implementing the oral health intervention program, focusing on the potential predictors of ECC obtained from this innovative model, policymakers could be able to evaluate their prediction models comparing their results with the findings of the current study. This comparison will guide them in understanding, designing, and implementing a more effective intervention program for ECC prevention.

Journal article Open access Peer reviewed

Hybrid Model for 6G Network Traffic Prediction and Wireless Resource Optimisation

by Mohammed Anis Oukebdane, A. F. M. Shahen Shah, Md Baharul Islam, John Ekoru and Milka Madahana

Published 01-01-2025

IEEE access, 13, 1 - 1

The fast change from 5G to 6G networks calls for extremely accurate network traffic prediction and effective resource allocation to meet rising data volumes and ultra-low latency requirements. To deal with the complicated time and space based aspects of 6G network traffic, an AI based hybrid model is developed that combines random forest (RF), gated recurrent units (GRU), and a mechanism for paying attention is proposed. Large-scale 6G traffic data with varied channel conditions and user scenarios was used to validate the model. An algorithm is presented to describe the training process of the proposed hybrid model. The results of the proposed hybrid model are presented and compared with baseline methods, including LSTM, GRU, random forest, and XGBoost. Our model obtains a Root Mean Squared Error (RMSE) of 0.0049, an Mean Absolute Error (MAE) of 0.0034, a mean absolute percentage error (MAPE) of 0.46%, and a coefficient of determination R 2 of 0.9970 according to experimental findings on a whole dataset. The suggested technique lowers the RMSE by over 69% and increases R 2 by up to 2.88% compared to baseline GRU and LSTM respectively. These results highlight how well combining deep sequence modelling with ensemble learning works. In next-generation wireless systems, the framework opens the path for proactive resource allocation, strong security, and real-time optimisation outside of improving forecast accuracy. Moreover, this paper provides a critical review of open research directions including the scalability of hybrid AI models, edge intelligence integration, and the evolution of standardised protocols for safe and smooth AI deployment in 6G networks.

Journal article Peer reviewed

Objective Quality Assessment of Stereoscopic Video Using Inflated 3D Features

by Hassan Imani and Md Baharul Islam

Published 08-15-2024

SN computer science, 5, 6, 799

Convolutional Neural Networks (CNNs) have been receiving research attention for Stereoscopic Video Quality Assessment (SVQA) in recent years. Recently, researchers have used 3D CNNs for extracting useful spatial and temporal features from stereo videos and have used them for detecting the reduction in the quality of the stereoscopic videos. To our best knowledge, the concept of transfer learning (TL) has not been well-examined in SVQA. Pretraining and fine-tuning are approaches used in deep neural networks to transform the knowledge learned from other general fields. The previous methods that utilized TL used very heavy 3D ResNet architectures with several layers; therefore, they are very time-consuming. In this paper, we develop a new model for SVQA and use the Inflated 3-Dimensional ConvNet (I3D) network as the backbone feature extractor for our model. We first apply left and right videos to I3D models to extract their features. Then, we apply 3D CNNs to learn quality-aware features from stereo videos. We evaluate our proposed method using LFOVIAS3DPh2 and NAMA3DS1- COSPAD1 SVQA datasets. Extensive experimental studies on two datasets prove that the proposed method correlates with the subjective results. The Root-Mean-Square Error (RMSE) for the NAMA3DS1-COSPAD1 dataset is 0.2454, and the high amount of Linear Correlation Coefficient (LCC) and Spearmen Rank Order Correlation Coefficient (SROCC) values (0.895 and 0.901 respectively) for LFOVIAS3DPh2 dataset show the compatibility of the results with human visual system (HVS). Despite having lighter architecture than the best performing method, the proposed method outperforms most of the methods and overall it is the second best performing method available.

Journal article Open access Peer reviewed

WNet: A dual‐encoded multi‐human parsing network

by Md Imran Hosen, Tarkan Aydin and Md Baharul Islam

Published 07-10-2024

IET image processing, 18, 12

Abstract In recent years, multi‐human parsing has become a focal point in research, yet prevailing methods often rely on intermediate stages and lacking pixel‐level analysis. Moreover, their high computational demands limit real‐world efficiency. To address these challenges and enable real‐time performance, low‐latency end‐to‐end network is proposed. This approach leverages vision transformer and convolutional neural network in a dual‐encoded network, featuring a lightweight Transformer‐based vision encoder) and a convolution encoder based on Darknet. This combination adeptly captures long‐range dependencies and spatial relationships. Incorporating a fuse block enables the seamless merging of features from the encoders. Residual connections in the decoder design amplify information flow. Experimental validation on crowd instance‐level human parsing and look into person datasets showcases the WNet's effectiveness, achieving high‐speed multi‐human parsing at 26.7 frames per second. Ablation studies further underscore WNet's capabilities, emphasizing its efficiency and accuracy in complex multi‐human parsing tasks.

Journal article Open access Peer reviewed

Stereoscopic video deblurring transformer

by Hassan Imani, Md Baharul Islam, Masum Shah Junayed and Md Atiqur Rahman Ahad

Published 06-21-2024

Scientific reports, 14, 1, 14342 - 14

Stereoscopic cameras, such as those in mobile phones and various recent intelligent systems, are becoming increasingly common. Multiple variables can impact the stereo video quality, e.g., blur distortion due to camera/object movement. Monocular image/video deblurring is a mature research field, while there is limited research on stereoscopic content deblurring. This paper introduces a new Transformer-based stereo video deblurring framework with two crucial new parts: a self-attention layer and a feed-forward layer that realizes and aligns the correlation among various video frames. The traditional fully connected (FC) self-attention layer fails to utilize data locality effectively, as it depends on linear layers for calculating attention maps The Vision Transformer, on the other hand, also has this limitation, as it takes image patches as inputs to model global spatial information. 3D convolutional neural networks (3D CNNs) process successive frames to correct motion blur in the stereo video. Besides, our method uses other stereo-viewpoint information to assist deblurring. The parallax attention module (PAM) is significantly improved to combine the stereo and cross-view information for more deblurring. An extensive ablation study validates that our method efficiently deblurs the stereo videos based on the experiments on two publicly available stereo video datasets. Experimental results of our approach demonstrate state-of-the-art performance compared to the image and video deblurring techniques by a large margin.

Journal article Open access Peer reviewed

MLMSign: Multi-lingual multi-modal illumination-invariant sign language recognition

by Arezoo Sadeghzadeh, A.F.M. Shahen Shah and Md Baharul Islam

Published 06-01-2024

Intelligent systems with applications, 22, 200384

Sign language (SL) serves as a visual communication tool bearing great significance for deaf people to interact with others and facilitate their daily life. Wide varieties of SLs and the lack of interpretation knowledge necessitate developing automated sign language recognition (SLR) systems to attenuate the communication gap between the deaf and hearing communities. Despite numerous advanced static SLR systems, they are not practical and favorable enough for real-life scenarios once assessed simultaneously from different critical aspects: accuracy in dealing with high intra- and slight inter-class variations, robustness, computational complexity, and generalization ability. To this end, we propose a novel multi-lingual multi-modal SLR system, namely MLMSign, by taking full strengths of hand-crafted features and deep learning models to enhance the performance and the robustness of the system against illumination changes while minimizing computational cost. The RGB sign images and 2D visualizations of their hand-crafted features, i.e., Histogram of Oriented Gradients (HOG) features and a∗ channel of L∗a∗b∗ color space, are employed as three input modalities to train a novel Convolutional Neural Network (CNN). The number of layers, filters, kernel size, learning rate, and optimization technique are carefully selected through an extensive parametric study to minimize the computational cost without compromising accuracy. The system’s performance and robustness are significantly enhanced by jointly deploying the models of these three modalities through ensemble learning. The impact of each modality is optimized based on their impact coefficient determined by grid search. In addition to the comprehensive quantitative assessment, the capabilities of our proposed model and the effectiveness of ensembling over three modalities are evaluated qualitatively using the Grad-CAM visualization model. Experimental results on the test data with additional illumination changes verify the high robustness of our system in dealing with overexposed and underexposed lighting conditions. Achieving a high accuracy (>99.33%) on six benchmark datasets (i.e., Massey, Static ASL, NUS II, TSL Fingerspelling, BdSL36v1, and PSL) demonstrates that our system notably outperforms the recent state-of-the-art approaches with a minimum number of parameters and high generalization ability over complex datasets. Its promising performance for four different sign languages makes it a feasible system for multi-lingual applications.
[Display omitted]
•Propose multi-lingual sign language recognition using handcrafted and deep features.•Extract HOG and L∗a∗b∗ features to generate robust and representative modalities.•Offer a parametric study to optimize a CNN for high performance with minimized cost.•Apply weighted ensemble on CNNs of 3 modalities to improve accuracy and robustness.•Evaluate performance and lighting-invariance on 6 datasets for multi-lingual apps.

Journal article Peer reviewed

Spatial-Temporal Coherence in Extreme Video Retargeting for Consumer Screening Devices

by Hassan Imani and Md Baharul Islam

Published 2024

IEEE transactions on consumer electronics, 71, 2, 1 - 1

The accessibility of diverse display devices and their aspect ratios has drawn much research attention to video retargeting. Non-consistent video retargeting can significantly affect a video's spatial and temporal quality, particularly in extreme retargeting cases. Since there are no perfectly annotated datasets for video retargeting, deep learning-based techniques are rarely utilized. This paper proposes a method that learns to retarget videos by detecting the salient areas and shifting them to the appropriate location. First, we segment the salient objects using a unified Transformer model. Using convolutional layers and a shifting strategy, we shift and warp objects to the appropriate size and location in the frame. We use 1D convolution to move the salient items in the scene. Additionally, we employ a frame interpolation technique to preserve temporal information. To train the network, we feed the retargeted frames to a variational auto-encoder network to map the retargeted frames back to the input frames. Furthermore, we design perceptual and wavelet-based loss functions to train our model. Thus, we train the network unsupervised. Extensive qualitative and quantitative experiments on the DAVIS dataset show the superiority of the proposed method over existing image and video-based methods.

Journal article Open access Peer reviewed

ARVA: An Augmented Reality-based Visual Aid for Mobility Enhancement Through Real-time Video Stream Transformation

by Arezoo Sadeghzadeh, Md Baharul Islam, Md Nur Uddin and Tarkan Aydin

Published 01-01-2024

IEEE access, 12, 1 - 1

Visual field loss (VFL) is a persistent visual impairment characterized by blind spots (scotoma) within the normal visual field, significantly impacting daily activities for affected individuals. CurrentVirtual Reality (VR) and Augmented Reality (AR)-based visual aids suffer from low video quality, content loss, high levels of contradiction, and limited mobility assessment. To address these issues, we propose an innovative vision aid utilizing AR headset and integrating advanced video processing techniques to elevate the visual perception of individuals with moderate to severe VFL to levels comparable to those with unimpaired vision. Our approach introduces a pioneering optimal video remapping function tailored to the characteristics of AR glasses. This function strategically maps the content of live video captures to the largest intact region of the visual field map, preserving quality while minimizing blurriness and content distortion. To evaluate the performance of our proposed method, a comprehensive empirical user study is conducted including object counting and multi-tasking walking track tests and involving 15 subjects with artificially induced scotomas in their normal visual fields. The proposed vision aid achieves 41.56% enhancement (from 57.31% to 98.87%) in the mean value of the average object recognition rates for all subjects in object counting test. In walking track test, the average mean scores for obstacle avoidance, detected signs, recognized signs, and grasped objects are significantly enhanced after applying the remapping function, with improvements of 7.56% (91.10% to 98.66%), 51.81% (44.85% to 96.66%), 49.31% (43.18% to 92.49%), and 77.77% (13.33% to 91.10%), respectively. Statistical analysis of data before and after applying the remapping function demonstrates the promising performance of our method in enhancing visual awareness and mobility for individuals with VFL.

Journal article

Generative AI for Recognizing Nurse Training Activities in Skeleton-Based Video Data

by Md Ibrahim Mamun, Shahera Hossain, Md Baharul Islam and Md Atiqur Rahman Ahad

Published 2024

International Journal of Activity and Behavior Computing, 2024, 3, 1 - 20

Endotracheal suctioning (ES) is a complex procedure associated with a series of actions and inherent risks, particularly in the intensive care unit (ICU). Given the importance of precise execution, it is preferable to have skilled nurses perform ES tasks. To facilitate nurse training and ensure proficiency in ES procedures, automated nursing activity recognition presents a promising solution, offering benefits in terms of cost, time, and effort. In this paper, we propose a novel approach to nurse training activity recognition for ES tasks, leveraging the capabilities of Generative Artificial Intelligence (GenAI). Specifically, we demonstrate how Large Language Models (LLMs), a subset of GenAI, can enhance the efficiency of nursing activity recognition. By employing LLMs such as OpenAI's Generative Pre-trained Transformer (ChatGPT), Google's Gemini, and Microsoft's Copilot, we aim to improve the accuracy and efficiency of our methodology. Our study identifies a clear gap in the utilization of LLMs for more accurate determination of nursing activities related to ES, with reduced human interaction. Through the integration of approaches and data features suggested by LLMs, we achieve a notable increase in accuracy from baseline 0.51 to 0.58, along with an elevated F1 score from 0.31 to 0.46. These results underscore the potential of LLMs, as a subset of GenAI, to enhance traditional problem-solving efficiency by offering robust solutions and procedures.

Md Baharul Islam

Associate Professor, Software Engineering, Department of Computing and Software Engineering, U.A. Whitaker College of Engineering, Florida Gulf Coast University

Scholarship list