CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identificationwithout Concrete Text Labels
Conference: AAAI 2023
Abstract
Propose a 2-stage stratege to solve the problem of image re-identification without concrete text labels. In the 1st stage, image and text encoders from CLIP keep fixed, and only the text tokens are optimized from scratch by the contrastive loss computed within a batch. In the 2nd stage, the ID-specific text tokens and their encoder become static, providing constraints for fine-tuning the image encoder. With the designed loss in the downstream task, the image encoder is able to represent data as vectors in the feature embedding accurately. github
Introduction
Tradtional CNN-based and ViT-based methods heavily rely on pre-training on ImageNet which means semantics out of the set are ignored. CLIP: trained on largers datasets , changed pretraining task (matching visual features to language descriptions) -> Image encoder can sense jigh-level semantics from he text and learns transfferable features.
In the 1st stage, a series of learnable text tokens are incorporated and used to describle each ID ambiguously, abd optimized.
In the 2nd stage, tokends and text encoder keep static and provide ambiguous descriptions for each ID to build up the cross-modality image to text cross-entropy loss
Last updated