Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?

by Zihao - opened May 23, 2023

Discussion

Zihao

May 23, 2023

Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?

intfloat

Owner May 23, 2023

Nothing new, v2 models are simply pre-trained on larger quantity and more diverse text pair datasets.

bergum

May 26, 2023

•

edited May 26, 2023

Hi @intfloat , does this repo have the unsupervised weights (Table 1 in the paper), or the weights from after without fine-tuning on MS Marco/BEIR (Table 2)? paper

intfloat

Owner May 27, 2023

@bergum We do not have plans to release its unsupervised weights. Embedding models without supervised fine-tuning do not perform very well and are not suitable for out-of-box use cases. If you'd like to fine-tune from the unsupervised ones, you can build upon https://huggingface.co/intfloat/e5-base-unsupervised (also has small and large versions)

bergum

May 30, 2023

•

edited May 30, 2023

Thanks for confirming, @intfloat .

I'm asking because I can't reproduce the BEIR results reported in the paper or close to it. This could be explained if, by mistake, the wrong weights were uploaded.

With e5-base-v2 on TREC-COVID, I get 0.69633 ndcg_at_10, which is off compared to the .79 reported in the paper (which is a very good result for a dense model on TREC-COVID).

Edit:
It should be noted that this is on CPU; I haven't tested this on GPU yet, and only tested TREC-COVID.

python3 mteb_beir_eval.py --model-name-or-path intfloat/e5-base-v2
...
[2023-05-30 01:16:30,748 INFO] Evaluation for TRECCOVID on test took 89452.90 seconds
[2023-05-30 01:16:30,748 INFO] Scores: {'ndcg_at_1': 0.75, 'ndcg_at_3': 0.74397, 'ndcg_at_5': 0.73222, 'ndcg_at_10': 0.69633, 'ndcg_at_100': 0.52017, 'ndcg_at_1000': 0.48872, 'map_at_1': 0.00215, 'map_at_3': 0.00602, 'map_at_5': 0.00968, 'map_at_10': 0.01753, 'map_at_100': 0.09263, 'map_at_1000': 0.23437, 'recall_at_1': 0.00215, 'recall_at_3': 0.0065, 'recall_at_5': 0.01057, 'recall_at_10': 0.01961, 'recall_at_100': 0.12825, 'recall_at_1000': 0.46435, 'precision_at_1': 0.84, 'precision_at_3': 0.8, 'precision_at_5': 0.784, 'precision_at_10': 0.74, 'precision_at_100': 0.5326, 'precision_at_1000': 0.21844, 'mrr_at_1': 0.84, 'mrr_at_3': 0.91333, 'mrr_at_5': 0.91333, 'mrr_at_10': 0.91333, 'mrr_at_100': 0.91333, 'mrr_at_1000': 0.91333, 'evaluation_time': 89452.9}

intfloat

Owner May 30, 2023

•

edited May 30, 2023

@bergum The results in the paper correspond to https://huggingface.co/intfloat/e5-base instead of v2 models.

Your results are consistent with ours, which you can check at "Evaluation results" part of https://huggingface.co/intfloat/e5-base-v2 Note that software version and hardware could cause very minor differences.

By the way, the TREC COVID dataset is very small and has large performance fluctuations when fine-tuning with different random seeds. We mainly focus on the average results across all BEIR datasets.

bergum

May 30, 2023

Perfect, @intfloat . Thank you for your time explaining this. I wrongly assumed v1 and v2 would be similar. I see now that the self-reported ndcg_at_10 is 69.596, which is close and easily explained. Thank you for publishing this work, and for making it easy to reproduce!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment