The Problem
Breast cancer remains one of the leading causes of death among women globally. In Sub-Saharan Africa, the burden is disproportionately high — late-stage diagnosis is common, trained radiologists are scarce, and advanced imaging infrastructure is expensive. The modalities that are accessible — mammography and ultrasonography — are typically used in isolation, limiting their diagnostic potential.
The AI systems that have shown promise in breast cancer screening are largely unimodal and suffer from a critical limitation: they are black boxes. A model that achieves expert-level accuracy but cannot explain its reasoning erodes clinical trust. Practitioners understandably resist basing patient care decisions on systems they cannot understand or verify.
The research gap: No prior framework fused strictly non-invasive, accessible modalities (mammography + ultrasound) with first-class explainability designed specifically for resource-limited settings in Africa — without requiring biopsy data, genomics, or paired multimodal datasets.
Approach
MMIBC (Multimodal Medical Imaging for Breast Cancer) proposes an explainable multimodal Vision Transformer framework that fuses mammography and ultrasonography for unified breast cancer diagnosis. The design addresses three constraints simultaneously: low-resource deployment, unpaired dataset availability, and clinical interpretability.
X-ray images
Ultrasound images
feature extraction
feature extraction
concatenation
classifier
/ Normal
A key challenge was the absence of natively paired multimodal datasets — mammography and ultrasound images of the same patient taken simultaneously. To address this, MMIBC introduces a label-consistent programmatic pairing strategy: samples from the mammography and ultrasound datasets are paired by diagnostic label during training, enabling multimodal learning under realistic data constraints without requiring simultaneous dual-modality imaging.
Datasets
MMIBC uses two publicly available datasets representing diverse African and Asian populations — deliberately chosen to reflect real-world deployment contexts where proprietary hospital datasets are unavailable.
Key Contributions
Results
MMIBC achieved 84% overall classification accuracy across three diagnostic classes (normal, benign, malignant) on the VinDr-Mammo and BUSI test sets. Qualitative explainability results confirmed that Grad-CAM attention maps align with clinically relevant anatomical regions — microcalcification clusters and lesion margins — rather than image artifacts or background features.
| Class | Precision | Recall | F1 Score | Notes |
|---|---|---|---|---|
| Normal | High | High | High | Well-represented class |
| Benign | Moderate | Moderate | Moderate | Boundary cases with malignant |
| Malignant | Variable | Variable | Variable | Impacted by class imbalance |
| Overall | 84% | Weighted accuracy | ||
Explainability finding: Grad-CAM heatmaps consistently highlighted regions corresponding to known clinical indicators — microcalcifications in mammograms, hypoechoic lesion margins in ultrasound — confirming that the model's predictions are grounded in clinically relevant evidence, not spurious correlations.
Dataset imbalance — particularly for malignant cases — highlighted a critical real-world constraint: deployment in low-resource settings means working with skewed class distributions. The paper analyzes these trade-offs explicitly, which influenced the design of the evaluation protocol to go beyond aggregate accuracy.
Limitations & Future Work
The programmatic pairing strategy, while necessary and effective, is a proxy for true paired multimodal imaging. Future work should incorporate natively paired datasets as they become available. The class imbalance problem in malignant case detection warrants continued attention — oversampling strategies, focal loss formulations, and synthetic augmentation using diffusion models are all candidate approaches.
The framework was validated on African and Asian population data (Egypt, Vietnam). Extension to broader demographic datasets, particularly with West African imaging data where breast tissue density patterns differ, would strengthen clinical generalizability. Finally, prospective clinical validation with radiologist feedback loops is the essential next step before real-world deployment.