Back to Projects
Case Study · Research · Medical AI

MMIBC: Explainable Multimodal Vision Transformer for Breast Cancer Diagnosis

Testimony Adekoya · Submitted MIWAI 2026 · DSN AI+ Bootcamp 2025
PyTorch Vision Transformers Multimodal ML XAI Medical Imaging Best Poster Award
Best Poster Award — DSN AI+ Bootcamp 2025 · Under review at MIWAI 2026
84%
Overall Classification Accuracy
2
Imaging Modalities Fused
ViT
Parallel Transformer Backbones
Grad-CAM
Explainability Method
01 /

The Problem

Breast cancer remains one of the leading causes of death among women globally. In Sub-Saharan Africa, the burden is disproportionately high — late-stage diagnosis is common, trained radiologists are scarce, and advanced imaging infrastructure is expensive. The modalities that are accessible — mammography and ultrasonography — are typically used in isolation, limiting their diagnostic potential.

The AI systems that have shown promise in breast cancer screening are largely unimodal and suffer from a critical limitation: they are black boxes. A model that achieves expert-level accuracy but cannot explain its reasoning erodes clinical trust. Practitioners understandably resist basing patient care decisions on systems they cannot understand or verify.

The research gap: No prior framework fused strictly non-invasive, accessible modalities (mammography + ultrasound) with first-class explainability designed specifically for resource-limited settings in Africa — without requiring biopsy data, genomics, or paired multimodal datasets.

02 /

Approach

MMIBC (Multimodal Medical Imaging for Breast Cancer) proposes an explainable multimodal Vision Transformer framework that fuses mammography and ultrasonography for unified breast cancer diagnosis. The design addresses three constraints simultaneously: low-resource deployment, unpaired dataset availability, and clinical interpretability.

A key challenge was the absence of natively paired multimodal datasets — mammography and ultrasound images of the same patient taken simultaneously. To address this, MMIBC introduces a label-consistent programmatic pairing strategy: samples from the mammography and ultrasound datasets are paired by diagnostic label during training, enabling multimodal learning under realistic data constraints without requiring simultaneous dual-modality imaging.

03 /

Datasets

MMIBC uses two publicly available datasets representing diverse African and Asian populations — deliberately chosen to reflect real-world deployment contexts where proprietary hospital datasets are unavailable.

VinDr-Mammo
Mammography — X-ray
Large-scale mammography dataset from Vietnamese hospitals. Provides X-ray images with radiologist-annotated BI-RADS assessments across benign and malignant findings.
Modality: Mammography
Origin: Vietnam
BUSI
Breast Ultrasound Images
Breast Ultrasound Images dataset collected from Egyptian women aged 25–75. Covers normal, benign, and malignant cases with expert segmentation masks.
Modality: Ultrasonography
Origin: Egypt
04 /

Key Contributions

I
Explainable Multimodal Architecture
Parallel pretrained ViT backbones for modality-specific representation learning, followed by feature-level fusion and a lightweight MLP classification head. Designed to run on accessible imaging hardware without high-compute infrastructure.
II
Programmatic Pairing Strategy
Label-consistent pairing of mammography and ultrasound samples enables multimodal training from publicly available but natively unpaired datasets — removing the dependency on simultaneous dual-modality imaging sessions.
III
Grad-CAM Clinical Transparency
Integration of Gradient-weighted Class Activation Mapping produces visual heatmaps highlighting the anatomical regions that drove each diagnostic decision — providing interpretable evidence that frontline clinicians can evaluate and trust.
IV
Context-Specific Benchmarking
Detailed evaluation of performance trade-offs arising from dataset class imbalance in low-resource settings — surfacing real deployment constraints rather than reporting only best-case accuracy.
05 /

Results

MMIBC achieved 84% overall classification accuracy across three diagnostic classes (normal, benign, malignant) on the VinDr-Mammo and BUSI test sets. Qualitative explainability results confirmed that Grad-CAM attention maps align with clinically relevant anatomical regions — microcalcification clusters and lesion margins — rather than image artifacts or background features.

Class Precision Recall F1 Score Notes
Normal High High High Well-represented class
Benign Moderate Moderate Moderate Boundary cases with malignant
Malignant Variable Variable Variable Impacted by class imbalance
Overall 84% Weighted accuracy

Explainability finding: Grad-CAM heatmaps consistently highlighted regions corresponding to known clinical indicators — microcalcifications in mammograms, hypoechoic lesion margins in ultrasound — confirming that the model's predictions are grounded in clinically relevant evidence, not spurious correlations.

Dataset imbalance — particularly for malignant cases — highlighted a critical real-world constraint: deployment in low-resource settings means working with skewed class distributions. The paper analyzes these trade-offs explicitly, which influenced the design of the evaluation protocol to go beyond aggregate accuracy.

06 /

Limitations & Future Work

The programmatic pairing strategy, while necessary and effective, is a proxy for true paired multimodal imaging. Future work should incorporate natively paired datasets as they become available. The class imbalance problem in malignant case detection warrants continued attention — oversampling strategies, focal loss formulations, and synthetic augmentation using diffusion models are all candidate approaches.

The framework was validated on African and Asian population data (Egypt, Vietnam). Extension to broader demographic datasets, particularly with West African imaging data where breast tissue density patterns differ, would strengthen clinical generalizability. Finally, prospective clinical validation with radiologist feedback loops is the essential next step before real-world deployment.