| | --- |
| | pipeline_tag: translation |
| | library_name: comet |
| | language: |
| | - multilingual |
| | - af |
| | - am |
| | - ar |
| | - as |
| | - az |
| | - be |
| | - bg |
| | - bn |
| | - br |
| | - bs |
| | - ca |
| | - cs |
| | - cy |
| | - da |
| | - de |
| | - el |
| | - en |
| | - eo |
| | - es |
| | - et |
| | - eu |
| | - fa |
| | - fi |
| | - fr |
| | - fy |
| | - ga |
| | - gd |
| | - gl |
| | - gu |
| | - ha |
| | - he |
| | - hi |
| | - hr |
| | - hu |
| | - hy |
| | - id |
| | - is |
| | - it |
| | - ja |
| | - jv |
| | - ka |
| | - kk |
| | - km |
| | - kn |
| | - ko |
| | - ku |
| | - ky |
| | - la |
| | - lo |
| | - lt |
| | - lv |
| | - mg |
| | - mk |
| | - ml |
| | - mn |
| | - mr |
| | - ms |
| | - my |
| | - ne |
| | - nl |
| | - 'no' |
| | - om |
| | - or |
| | - pa |
| | - pl |
| | - ps |
| | - pt |
| | - ro |
| | - ru |
| | - sa |
| | - sd |
| | - si |
| | - sk |
| | - sl |
| | - so |
| | - sq |
| | - sr |
| | - su |
| | - sv |
| | - sw |
| | - ta |
| | - te |
| | - th |
| | - tl |
| | - tr |
| | - ug |
| | - uk |
| | - ur |
| | - uz |
| | - vi |
| | - xh |
| | - yi |
| | - zh |
| | license: apache-2.0 |
| | base_model: |
| | - FacebookAI/xlm-roberta-large |
| | --- |
| | |
| | # PreCOMET-diff [](https://arxiv.org/abs/2501.18251) |
| |
|
| | This is a source-only COMET model used for efficient evaluation subset selection. |
| | Specifically this model predicts `difficulty` distilled from an IRT model from up to WMT2022 (inclusive). |
| | The higher the scores, the better it is for evaluation because models will likely fail to translate the segment. |
| | It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET): |
| | ```bash |
| | pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git |
| | ``` |
| |
|
| | You can then use it in Python: |
| | ```python |
| | import precomet |
| | model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-diff")) |
| | model.predict([ |
| | {"src": "This is an easy source sentence."}, |
| | {"src": "this is a much more complicated source sen-tence that will pro·bably lead to loww scores 🤪"} |
| | ])["scores"] |
| | > [-0.3407433331012726, 0.6234546899795532] |
| | ``` |
| |
|
| | The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package: |
| |
|
| | ```python |
| | import subset2evaluate |
| | |
| | data_full = subset2evaluate.utils.load_data("wmt23/en-cs") |
| | data_random = subset2evaluate.select_subset.basic(data_full, method="random") |
| | subset2evaluate.evaluate.eval_subset_clusters(data_random[:100]) |
| | > 1 |
| | subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full) |
| | > 0.71 |
| | ``` |
| | Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model: |
| | ```python |
| | data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_diff") |
| | subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100]) |
| | > 1 |
| | subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full) |
| | > 0.93 |
| | ``` |
| | we get higher correlation. |
| | Note that this is not the best PreCOMET model and you can expect a bigger effect on a larger scale, as described in the paper. |
| |
|
| |
|
| | This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251). |
| | Cite as: |
| | ``` |
| | @misc{zouhar2025selectdatapointsefficienthuman, |
| | title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?}, |
| | author={Vilém Zouhar and Peng Cui and Mrinmaya Sachan}, |
| | year={2025}, |
| | eprint={2501.18251}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2501.18251}, |
| | } |
| | ``` |