NaN values when input is longer than context window?

#11
by AHuguet - opened

I am trying the example code you gave changing the texts to a texts of 1000 characters and it's giving me NaN rows.

In fact, this code:

# Requires transformers>=4.48.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

input_texts = [
    "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. N",
    "Lorem ipsum dolor sit amet, consectetuer adipiscing",
]

model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
embeddings = model.encode(input_texts)
print(embeddings.shape)

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Is outputting:

(2, 768)
tensor([[nan]])

Also if I print(embeddings):

[[ 0.7699229  -0.7961636  -0.5763464  ... -0.39776576  1. 0.44405958]
 [        nan         nan         nan ...         nan         nan]]

If I reduce the first input to less characters, then it works fine:

# Requires transformers>=4.48.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

input_texts = [
    "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.",
    "Lorem ipsum dolor sit amet, consectetuer adipiscing",
    "Lorem ipsum dolor sit amet, consectetuer adipiscing",
    "Lorem ipsum dolor sit amet, consectetuer adipiscing",
]

model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
embeddings = model.encode(input_texts)
print(embeddings.shape)

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

print(embeddings)

Gives:

(4, 768)
tensor([[0.9162, 0.9162, 0.9162]])
[[ 1.6280938   0.11482729 -0.48527887 ...  0.37245387  1.4648473
   0.5769439 ]
 [ 1.2851498   0.14559714 -0.6529583  ... -0.6485674   1.296109
  -0.26772726]
 [ 1.2851498   0.14559714 -0.6529583  ... -0.6485674   1.296109
  -0.26772726]
 [ 1.2851498   0.14559714 -0.6529583  ... -0.6485674   1.296109
  -0.26772726]]

Is it because it is surpassing the context window?

AHuguet changed discussion title from NaN values when calling encode? to NaN values when input is longer than context window?

Hello!
I think there was a bug in torch that resulted in some nan sometimes with the ModernBERT architecture. Could you try increasing your torch version?

pip install -U torch
  • Tom Aarsen

Sign up or log in to comment