We propose MaRS, a multi-modality very-high-resolution (VHR) remote sensing foundation model designed for cross-modality, cross-granularity interpretation of complex scenes. We construct MaRS-16M, a large-scale VHR SAR optical paired dataset (16M+ pairs) via collection and semi-automated processing. MaRS tackles two core VHR SAR optical SSL challenges: imaging discrepancy and modal representation gap. To this end, we introduce Cross-Granularity Contrastive Learning (CGCL) to alleviate alignment inconsistencies by linking patch- and image-level semantics, and design Meta-Modality Attention (MMA) to unify heterogeneous physical characteristics across modalities. Compared with existing RSFMs and general VFMs, MaRS serves as a strong pretrained backbone across nine multi-modality VHR downstream tasks. Dataset and code are available at the project page.
Architecture. MaRS adopts dual encoders (ERGB, ESAR, both SwinV2) to extract modality-specific tokens, followed by a Meta-Modality Attention (MMA) Transformer that alternates intra-modality and cross-modality attention to form a unified representation. Light decoders are used for dense prediction.
Pretraining. We combine three self-supervised strategies: (1) CGCL at patch-, image-, and patch-to-global levels to mitigate local distortions while preserving global semantics; (2) masked image modeling per modality branch; (3) continued pretraining on VHR optical to further strengthen representation quality. Training uses 512×512 inputs, 60% masking, and runs on 8×A800 GPUs.
MaRS is evaluated across nine representative VHR tasks and achieves strong results compared with RSFMs/VFMs and task-specific methods:
MaRS produces sharper activations along object boundaries and shows more consistent activation regions across RGB/SAR, indicating improved modality-invariant representation and fine-grained detail modeling.
@inproceedings{yang2026mars,
title={MaRS: A Multi-Modality Very-High-Resolution Remote Sensing Foundation Model with Cross-Granularity Meta-Modality Learning},
author={Ruoyu Yang and Yinhe Liu and Heng Yan and Yiheng Zhou and Yihan Fu and Han Luo and Yanfei Zhong},
booktitle={AAAI Conference on Artificial Intelligence},
year={2026}
}