upadte readme
Browse files
README.md
CHANGED
@@ -13,10 +13,10 @@ pinned: false
|
|
13 |
# Metric Card for SemScore
|
14 |
|
15 |
## Metric Description
|
16 |
-
SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a
|
17 |
|
18 |
## How to Use
|
19 |
-
When loading SemScore, you can choose any pre-trained
|
20 |
|
21 |
```python
|
22 |
import evaluate
|
@@ -34,6 +34,7 @@ Its optional arguments are:
|
|
34 |
|
35 |
- `batch_size`: the batch size for calculating the score (default value is `32`).
|
36 |
- `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`).
|
|
|
37 |
|
38 |
|
39 |
```python
|
@@ -48,11 +49,13 @@ The output of SemScore is a dictionary with the following values:
|
|
48 |
- `semscore`: aggregated system-level SemScore.
|
49 |
- `similarities`: cosine similarities between individual prediction-reference pairs.
|
50 |
|
|
|
|
|
51 |
#### Values from Popular Papers
|
52 |
[SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs.
|
53 |
|
54 |
## Limitations and Bias
|
55 |
-
One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation
|
56 |
|
57 |
In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted.
|
58 |
|
|
|
13 |
# Metric Card for SemScore
|
14 |
|
15 |
## Metric Description
|
16 |
+
SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a text embedding model is used to calculate cosine similarities between them.
|
17 |
|
18 |
## How to Use
|
19 |
+
When loading SemScore, you can choose any pre-trained model uploaded to HF Hub in order to compute the score. The default model (if no `model_name` is specified) is `sentence-transformers/all-mpnet-base-v2`.
|
20 |
|
21 |
```python
|
22 |
import evaluate
|
|
|
34 |
|
35 |
- `batch_size`: the batch size for calculating the score (default value is `32`).
|
36 |
- `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`).
|
37 |
+
- `pooling`: the type of pooling to use in order to aggregate the scores (default value is `mean`, optional value is `last` for decoder-only models).
|
38 |
|
39 |
|
40 |
```python
|
|
|
49 |
- `semscore`: aggregated system-level SemScore.
|
50 |
- `similarities`: cosine similarities between individual prediction-reference pairs.
|
51 |
|
52 |
+
The computed similarity scores are constrained to [0, 1] range and the scaled by 100. As a result, the final SemScore range is [0, 100].
|
53 |
+
|
54 |
#### Values from Popular Papers
|
55 |
[SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs.
|
56 |
|
57 |
## Limitations and Bias
|
58 |
+
One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation uses the strongest sentence transformer model, as reported by the authors of the `sentence-transformers` library, by default. However, better embedding models have become available since the publication of the SemScore paper (e.g. those listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)).
|
59 |
|
60 |
In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted.
|
61 |
|