Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?

#1
by Zihao - opened

Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?

Nothing new, v2 models are simply pre-trained on larger quantity and more diverse text pair datasets.

Hi @intfloat , does this repo have the unsupervised weights (Table 1 in the paper), or the weights from after without fine-tuning on MS Marco/BEIR (Table 2)? paper

@bergum We do not have plans to release its unsupervised weights. Embedding models without supervised fine-tuning do not perform very well and are not suitable for out-of-box use cases. If you'd like to fine-tune from the unsupervised ones, you can build upon https://huggingface.co/intfloat/e5-base-unsupervised (also has small and large versions)

Thanks for confirming, @intfloat .

I'm asking because I can't reproduce the BEIR results reported in the paper or close to it. This could be explained if, by mistake, the wrong weights were uploaded.

With e5-base-v2 on TREC-COVID, I get 0.69633 ndcg_at_10, which is off compared to the .79 reported in the paper (which is a very good result for a dense model on TREC-COVID).

Edit:
It should be noted that this is on CPU; I haven't tested this on GPU yet, and only tested TREC-COVID.

python3 mteb_beir_eval.py --model-name-or-path intfloat/e5-base-v2
...
[2023-05-30 01:16:30,748 INFO] Evaluation for TRECCOVID on test took 89452.90 seconds
[2023-05-30 01:16:30,748 INFO] Scores: {'ndcg_at_1': 0.75, 'ndcg_at_3': 0.74397, 'ndcg_at_5': 0.73222, 'ndcg_at_10': 0.69633, 'ndcg_at_100': 0.52017, 'ndcg_at_1000': 0.48872, 'map_at_1': 0.00215, 'map_at_3': 0.00602, 'map_at_5': 0.00968, 'map_at_10': 0.01753, 'map_at_100': 0.09263, 'map_at_1000': 0.23437, 'recall_at_1': 0.00215, 'recall_at_3': 0.0065, 'recall_at_5': 0.01057, 'recall_at_10': 0.01961, 'recall_at_100': 0.12825, 'recall_at_1000': 0.46435, 'precision_at_1': 0.84, 'precision_at_3': 0.8, 'precision_at_5': 0.784, 'precision_at_10': 0.74, 'precision_at_100': 0.5326, 'precision_at_1000': 0.21844, 'mrr_at_1': 0.84, 'mrr_at_3': 0.91333, 'mrr_at_5': 0.91333, 'mrr_at_10': 0.91333, 'mrr_at_100': 0.91333, 'mrr_at_1000': 0.91333, 'evaluation_time': 89452.9}

@bergum The results in the paper correspond to https://huggingface.co/intfloat/e5-base instead of v2 models.

Your results are consistent with ours, which you can check at "Evaluation results" part of https://huggingface.co/intfloat/e5-base-v2 Note that software version and hardware could cause very minor differences.

By the way, the TREC COVID dataset is very small and has large performance fluctuations when fine-tuning with different random seeds. We mainly focus on the average results across all BEIR datasets.

Perfect, @intfloat . Thank you for your time explaining this. I wrongly assumed v1 and v2 would be similar. I see now that the self-reported ndcg_at_10 is 69.596, which is close and easily explained. Thank you for publishing this work, and for making it easy to reproduce!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment