武汉大学苏科华--中文主页-- Textual Enhanced Adaptive Meta-Fusion for Few-shot Visual Recognition

Textual Enhanced Adaptive Meta-Fusion for Few-shot Visual Recognition

发布时间：2023-09-27 点击次数：

摘要：: Few-shot learning (FSL) is a challenging task that aims to train a classifier to recognize novel categories, where only a few annotated examples are available in each category. Recently, many FSL approaches have been proposed based on the meta-learning paradigm, which attempts to learn transferable knowledge from similar tasks by designing a meta-learner. However, most of these approaches only exploit the information from visual modality and do not utilize ones from additional modalities (<italic>e.g.</italic>, textual description). Since the labeled examples in FSL are limited, increasing the information on the examples is a probable solution to improve the classification performance. This motivates us to propose a novel meta-learning method, termed textual enhanced adaptive meta-fusion FSL (TAMF-FSL), which leverages both the visual information from the visual image and semantic information from language supervision. Specifically, TAMF-FSL exploits the semantic information of textual description to improve the visual-based models. We first employ a text encoder to learn the semantic features of each visual category, and then design a modality alignment module and meta-fusion module to align and fuse the visual and semantic features for final prediction. Extensive experiments show that the proposed method outperforms many recent or competitive FSL counterparts on two popular datasets.