Abstract
This paper presents a pneumonia detection approach using real-world data, Multimodal, and Federated Learning (MMFL), combining real-world chest X-ray data and blood tests. The work compares late and intermediate fusion architectures for integrating image and tabular data to improve pneumonia identification. The dataset was anonymized and contains 2,343 entries from 2,201 patients after data cleaning. Two multimodal model architectures were explored: late fusion; and intermediate/hybrid fusion. Image classification models such as Visual Transformer (ViT), EfficientNetV2, and Xception were evaluated, and for tabular data, XGBoost was employed. In the context of federated learning, the study proposes a federated late fusion multimodal model. Each client trains ViT models for images and XGBoost for tabular data, which are subsequently aggregated on the server. Federated models were trained in multiple institutions, each with its own data division. The results showed that the centrally trained multimodal late fusion model using ViT and XGBoost achieved an accuracy and AUC of 95,40% and 98,79% respectively, achieving the best overall performance. The federated multimodal model results also proved to be a viable alternative when data is decentralized, with an accuracy of 90,33% and AUC of 96,67% when two clients are in the federation.