Papers
arxiv:2407.01264

SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Published on Jul 1, 2024
Authors:
,
,
,
,
,

Abstract

We present Sign<PRE_TAG>CLIP</POST_TAG>, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project <PRE_TAG>spoken language text</POST_TAG> and sign language videos, two classes of natural languages of distinct modalities, into the same space. Sign<PRE_TAG>CLIP</POST_TAG> is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain Sign<PRE_TAG>CLIP</POST_TAG> on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. Sign<PRE_TAG>CLIP</POST_TAG> discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the <PRE_TAG>spoken language text</POST_TAG> and sign language poses, which provides additional linguistic insights. Our code and models are openly available.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.01264 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.01264 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.