Mode
Text Size
Log in / Sign up

Systematic evaluation of harmonisation methods reduces variability in multisite MRI volumetric imaging-derived phenotypes

Systematic evaluation of harmonisation methods reduces variability in multisite MRI volumetric imagi…
Photo by Lightsaber Collection / Unsplash
Key Takeaway
Note that harmonisation benefits for multisite MRI studies are context dependent and insufficiently understood in small, heterogeneous datasets.

This systematic evaluation examines image-based and statistical harmonisation methods within a clinically realistic, multisite, multiscanner structural T1-weighted (T1w) MRI test-retest dataset. The scope covers variability in volumetric imaging-derived phenotypes (IDPs) and rank consistency under repeatability, intra-scanner, and inter-scanner reproducibility scenarios. No specific sample size was reported for this dataset.

Key synthesized findings show that harmonisation yields the lowest variability in repeatability scenarios, with median variability ranging from 0.6% to 2.7% and rank consistency (rho) greater than or equal to 0.9. In intra-scanner reproducibility scenarios, modest increases in variability were observed, ranging from 0.5% to 3.2% with rho values between 0.5 and 1.0. Conversely, inter-scanner reproducibility scenarios without harmonisation demonstrated substantially greater variability, ranging from 1.7% to 19.2%, with rho values between -0.1 and 0.9. Approaches modelling site as a batch and accounting for repeated-measure structure in pooled data showed greater consistency across IDPs and more accurate reflection of underlying biological variation.

The authors note that the effectiveness of harmonisation in small, heterogeneous clinical datasets remains insufficiently understood and that performance was strongly context dependent. Consequently, harmonisation cannot be treated as a one-size-fits-all solution. This information is important to consider for multisite study design, including sample size calculation in clinical trials.

Study Details

EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Harmonisation is widely used to mitigate site- and scanner-related batch variability in multisite neuroimaging studies and is particularly critical in longitudinal clinical trials, where detection of subtle biological or treatment-related changes depends on reliable measurement across scanners and timepoints. However, the effectiveness of harmonisation in small, heterogeneous clinical datasets remains insufficiently understood, particularly in relation to subject-level variability and consistency across acquisition settings, and its impact on both removal of technical variability and preservation of biological variation in pooled multisite analyses. We systematically evaluated a range of image-based and statistical harmonisation methods using a clinically realistic multisite, multiscanner structural T1-weighted (T1w) MRI test-retest dataset comprising three controlled acquisition scenarios: repeatability, intra-scanner reproducibility and inter-scanner reproducibility. Methods were applied under different batch specifications (site, scanner, or both) and performance was assessed within each scenario and in pooled data using a multi-metric framework capturing both technical and biological variability in volumetric imaging-derived phenotypes (IDPs) relevant to aging and dementia research. Across IDPs, before harmonisation variability was lowest in the repeatability scenario (median variability=0.6 to 2.7%, rank consistency {rho} [≥]0.9), with modest increases under intra-scanner reproducibility (0.5 to 3.2%, {rho}=0.5 to 1.0) and substantially greater variability under inter-scanner reproducibility conditions (1.7 to 19.2%, {rho} =-0.1 to 0.9). These results offer important information to consider for multisite study design, including sample size calculation in clinical trials. Harmonisation performance was strongly context dependent, with clearer benefits emerged in inter-scanner scenarios where both variability reduction and improvements in subject-level consistency were observed. In pooled data, approaches that explicitly modelled site as batch and accounted for repeated-measure structure showed greater consistency across IDPs in batch effect mitigation and more accurately reflected underlying biological variation. Our evaluation metrics enabled disentangling the removal of global batch effect while highlighting residual variability at the phenotype-specific or multivariate levels. These findings demonstrate that harmonisation cannot be treated as a one-size-fits-all solution and must be interpreted relative to the acquisition context, dataset structure, and downstream analytic goals. Multi-metric evaluation under realistic clinical constraints is essential to support reliable and translatable neuroimaging inference by ensuring appropriate correction of batch effects while preserving longitudinal biological signals and sensitivity to clinically meaningful change in multisite studies.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.