武汉大学陈旭--中文主页

GraVLM: A Hierarchical Graph-Aligned Vision-Language Model for Cross-Modal Retrieval in Remote Sensing

发布时间：2026-03-12 点击次数：

DOI码：: 10.1109/tgrs.2026.3671643

所属单位：: 武汉大学计算机学院

发表刊物：: IEEE Transactions on Geoscience and Remote Sensing

关键字：: Remote sensing, cross-modal retrieval, visionlanguage models, hierarchical modeling, graph-based alignment

摘要：: CROSS-MODAL text–image retrieval in remote sensing (RS) aims to align RS images with natural-language descriptions for efficient access to large-scale RS image archives. Benefiting from large-scale image–text pretraining, recent vision–language models (VLMs) have demonstrated strong capability in modeling global scene-level semantics and have become a dominant paradigm for RS cross-modal text–image retrieval. However, unlike natural images, a single RS image often contains multiple objects organized in complex spatial layouts, whereas existing RSoriented retrieval VLMs mainly emphasize scene-level alignment and lack explicit modeling of object-level entities and their spatial relations. To overcome these limitations, we present GraVLM, a hierarchical graph-aligned VLM for fine-grained cross-modal retrieval in RS. GraVLM adopts a hierarchical design that integrates scene-level cross-modal interaction and object-level structural alignment. At the scene-level, we introduce a crossmodal interaction mechanism (CMIM) to capture global semantic correspondence between visual and textual modalities. At the object-level, a spatial object graph (SOG) and an entity relation graph (ERG) are constructed from the visual and textual modalities, respectively, enabling graph-aligned modeling of objectcentric spatial relations. Furthermore, we introduce an localguided feature fusion (LGFF) module that performs local–global feature fusion over the SOG and the ERG to obtain cross-modal fusion embeddings. Based on these fused representations, we propose a graph-aligned contrastive loss that jointly enforces object–entity correspondence and spatial-relation consistency between the two graphs. Extensive experiments demonstrate that GraVLM consistently and significantly outperforms existing RS cross-modal retrieval models.

论文类型：: 期刊论文

是否译文：: 否

发表时间：: 2026-03-06

附件：

GraVLM_TGRS_final.pdf

下一条：An energy-efficient source-anonymity protocol in surveillance systems

陈旭 var _tsites_com_view_mode_type_=8;

GraVLM: A Hierarchical Graph-Aligned Vision-Language Model for Cross-Modal Retrieval in Remote Sensing

附件：

陈旭