We propose MaRS, a multi-modality very-high-resolution (VHR) remote sensing foundation model designed for cross-modality, cross-granularity interpretation of complex scenes. We construct MaRS-16M, a large-scale VHR SAR optical paired dataset (16M+ pairs) via collection and semi-automated processing. MaRS tackles two core VHR SAR optical SSL challenges: imaging discrepancy and modal representation gap. To this end, we introduce Cross-Granularity Contrastive Learning (CGCL) to alleviate alignment inconsistencies by linking patch- and image-level semantics, and design Meta-Modality Attention (MMA) to unify heterogeneous physical characteristics across modalities. Compared with existing RSFMs and general VFMs, MaRS serves as a strong pretrained backbone across nine multi-modality VHR downstream tasks. Dataset and code are available at the project page.
Architecture. MaRS adopts dual encoders (ERGB, ESAR, both SwinV2) to extract modality-specific tokens, followed by a Meta-Modality Attention (MMA) Transformer that alternates intra-modality and cross-modality attention to form a unified representation. Light decoders are used for dense prediction.
Pretraining. We combine three self-supervised strategies: (1) CGCL at patch-, image-, and patch-to-global levels to mitigate local distortions while preserving global semantics; (2) masked image modeling per modality branch; (3) continued pretraining on VHR optical to further strengthen representation quality. Training uses 512×512 inputs, 60% masking, and runs on 8×A800 GPUs.
MaRS is evaluated across nine representative VHR tasks and achieves strong results compared with RSFMs/VFMs and task-specific methods:
MaRS produces sharper activations along object boundaries and shows more consistent activation regions across RGB/SAR, indicating improved modality-invariant representation and fine-grained detail modeling.
@inproceedings{Yang2026MaRS,
title={MaRS: A Multi-modality Very-high-resolution Remote Sensing Foundation Model with Cross-Granularity Meta-Modality Learning},
author={Yang, Ruoyu and Liu, Yinhe and Yan, Heng and Zhou, Yiheng and Fu, Yihan and Luo, Han and Zhong, Yanfei},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={40},
number={14},
pages={11685--11693},
year={2026},
doi={10.1609/aaai.v40i14.38153}
}