Do zero-shot classification models have a maximum token length?

#34

by stathacker - opened Mar 12, 2024

Mar 12, 2024

I have a database that consists of very long strings. Do different zero-shot NLP models have different token lengths and if so, how can I find that out for each one?
If so, can I break up my text into smaller sentences and average the scores of all the sentences to get a single score for the larger texts?

condrove10

Apr 29, 2024

I think the limit equals to "max_position_embeddings": 1024 config parameter. I suggest you to tokenize your text string before you your try to embed it. I think that trying to embed a function longer than 1024 tokens will cause a crash.

If you want to sqeeze as much information in 1024 tokens segments you could remove stopwords from your strings before embedding them.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment