How to fine tune this model?

#47
by ququwowo - opened

Hi All! Can anyone share or advise how to fine tune this model? like using sentence-transformer or other tools? tutorial or code examples would be great :)

Thanks!

Jina AI org

Hi @ququwowo , unfortunately we don't have any tutorials or code examples right now. SentenceTransformerTrainer might work with some changes, but we haven't tested it, so I can't give any tips yet. We will most likely publish a simple fine-tuning tutorial in the next few weeks and I'll let you know when it's ready.

Hi @ququwowo , unfortunately we don't have any tutorials or code examples right now. SentenceTransformerTrainer might work with some changes, but we haven't tested it, so I can't give any tips yet. We will most likely publish a simple fine-tuning tutorial in the next few weeks and I'll let you know when it's ready.

That's good news!

Hi @ququwowo , unfortunately we don't have any tutorials or code examples right now. SentenceTransformerTrainer might work with some changes, but we haven't tested it, so I can't give any tips yet. We will most likely publish a simple fine-tuning tutorial in the next few weeks and I'll let you know when it's ready.

Hi @jupyterjazz

hope all is well! May I follow up on this item?

Thanks!

I can report that simply modifying the sentence-transformers code that successfully can fine-tune version 3 will NOT work by simply moving to v4, with this error:

File "/opt/conda/lib/python3.11/site-packages/sentence_transformers/trainer.py", line 186, in __init__
    if tokenizer is None and isinstance(model.tokenizer, PreTrainedTokenizerBase):
                                        ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
    raise AttributeError(
AttributeError: 'SentenceTransformer' object has no attribute 'tokenizer'

Hello, I would also be interested in getting more informations on how to fine tune this multimodal embedding model for domain specific use cases

I was able to get fine-tuning to work with the SentenceTransformerTrainer by passing in tokenizer=model.tokenize to the class instantiation.

I'll note that this model is VERY memory-intensive. On a 48 GB GPU I was able to train with the retrieval task type with a batch size of 1 triplet. Any larger batch size would result in an OOM error.

Unless you need multi-modal embeddings, I'd personally suggest sticking to jina-embeddings-v3 which is better supported by ST and has a FAR smaller memory footprint. Here is some boilerplate code you can use to get started with that: https://huggingface.co/jinaai/jina-embeddings-v3/discussions/128#683a0102dca13af58b586ebd (pay attention to my later comments on that post- you'll need to set the default task every time you load, or when you call encode.)

Sign up or log in to comment