arxiv:2407.01264

SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Published on Jul 1, 2024

Authors:

Abstract

We present Sign<PRE_TAG>CLIP</POST_TAG>, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project <PRE_TAG>spoken language text</POST_TAG> and sign language videos, two classes of natural languages of distinct modalities, into the same space. Sign<PRE_TAG>CLIP</POST_TAG> is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain Sign<PRE_TAG>CLIP</POST_TAG> on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. Sign<PRE_TAG>CLIP</POST_TAG> discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the <PRE_TAG>spoken language text</POST_TAG> and sign language poses, which provides additional linguistic insights. Our code and models are openly available.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2407.01264 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2407.01264 in a dataset README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.