Do zero-shot classification models have a maximum token length?

#34
by stathacker - opened

I have a database that consists of very long strings. Do different zero-shot NLP models have different token lengths and if so, how can I find that out for each one?
If so, can I break up my text into smaller sentences and average the scores of all the sentences to get a single score for the larger texts?

I think the limit equals to "max_position_embeddings": 1024 config parameter. I suggest you to tokenize your text string before you your try to embed it. I think that trying to embed a function longer than 1024 tokens will cause a crash.

If you want to sqeeze as much information in 1024 tokens segments you could remove stopwords from your strings before embedding them.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment