GraVLM: A Hierarchical Graph-Aligned Vision-Language Model for Cross-Modal Retrieval in Remote Sensing
发布时间:2026-03-12
点击次数:
- DOI码:
- 10.1109/tgrs.2026.3671643
- 所属单位:
- 武汉大学计算机学院
- 发表刊物:
- IEEE Transactions on Geoscience and Remote Sensing
- 关键字:
- Remote sensing, cross-modal retrieval, visionlanguage models, hierarchical modeling, graph-based alignment
- 摘要:
- CROSS-MODAL text–image retrieval in remote sensing (RS) aims to align RS images with natural-language descriptions for efficient access to large-scale RS image archives. Benefiting from large-scale image–text pretraining, recent vision–language models (VLMs) have demonstrated strong capability in modeling global scene-level semantics and have become a dominant paradigm for RS cross-modal text–image retrieval. However, unlike natural images, a single RS image often contains multiple objects organized in complex spatial layouts, whereas existing RSoriented retrieval VLMs mainly emphasize scene-level alignment and lack explicit modeling of object-level entities and their spatial relations. To overcome these limitations, we present GraVLM, a hierarchical graph-aligned VLM for fine-grained cross-modal retrieval in RS. GraVLM adopts a hierarchical design that integrates scene-level cross-modal interaction and object-level structural alignment. At the scene-level, we introduce a crossmodal interaction mechanism (CMIM) to capture global semantic correspondence between visual and textual modalities. At the object-level, a spatial object graph (SOG) and an entity relation graph (ERG) are constructed from the visual and textual modalities, respectively, enabling graph-aligned modeling of objectcentric spatial relations. Furthermore, we introduce an localguided feature fusion (LGFF) module that performs local–global feature fusion over the SOG and the ERG to obtain cross-modal fusion embeddings. Based on these fused representations, we propose a graph-aligned contrastive loss that jointly enforces object–entity correspondence and spatial-relation consistency between the two graphs. Extensive experiments demonstrate that GraVLM consistently and significantly outperforms existing RS cross-modal retrieval models.
- 论文类型:
- 期刊论文
- 是否译文:
- 否
- 发表时间:
- 2026-03-06




