aynetdia commited on
Commit
6d82c6c
·
verified ·
1 Parent(s): af6eae2

upadte readme

Browse files
Files changed (1) hide show
  1. README.md +6 -3
README.md CHANGED
@@ -13,10 +13,10 @@ pinned: false
13
  # Metric Card for SemScore
14
 
15
  ## Metric Description
16
- SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained [sentence transformer](https://www.sbert.net) is used to calculate cosine similarities between them.
17
 
18
  ## How to Use
19
- When loading SemScore, you can choose any pre-trained encoder-only model uploaded to HF Hub in order to compute the score. The default model (if no `model_name` is specified) is `sentence-transformers/all-mpnet-base-v2`.
20
 
21
  ```python
22
  import evaluate
@@ -34,6 +34,7 @@ Its optional arguments are:
34
 
35
  - `batch_size`: the batch size for calculating the score (default value is `32`).
36
  - `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`).
 
37
 
38
 
39
  ```python
@@ -48,11 +49,13 @@ The output of SemScore is a dictionary with the following values:
48
  - `semscore`: aggregated system-level SemScore.
49
  - `similarities`: cosine similarities between individual prediction-reference pairs.
50
 
 
 
51
  #### Values from Popular Papers
52
  [SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs.
53
 
54
  ## Limitations and Bias
55
- One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation relies on the strongest sentence transformer model, as reported by the authors of the `sentence-transformers` library, by default. However, better embedding models have become available since the publication of the SemScore paper (e.g. those listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)).
56
 
57
  In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted.
58
 
 
13
  # Metric Card for SemScore
14
 
15
  ## Metric Description
16
+ SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a text embedding model is used to calculate cosine similarities between them.
17
 
18
  ## How to Use
19
+ When loading SemScore, you can choose any pre-trained model uploaded to HF Hub in order to compute the score. The default model (if no `model_name` is specified) is `sentence-transformers/all-mpnet-base-v2`.
20
 
21
  ```python
22
  import evaluate
 
34
 
35
  - `batch_size`: the batch size for calculating the score (default value is `32`).
36
  - `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`).
37
+ - `pooling`: the type of pooling to use in order to aggregate the scores (default value is `mean`, optional value is `last` for decoder-only models).
38
 
39
 
40
  ```python
 
49
  - `semscore`: aggregated system-level SemScore.
50
  - `similarities`: cosine similarities between individual prediction-reference pairs.
51
 
52
+ The computed similarity scores are constrained to [0, 1] range and the scaled by 100. As a result, the final SemScore range is [0, 100].
53
+
54
  #### Values from Popular Papers
55
  [SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs.
56
 
57
  ## Limitations and Bias
58
+ One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation uses the strongest sentence transformer model, as reported by the authors of the `sentence-transformers` library, by default. However, better embedding models have become available since the publication of the SemScore paper (e.g. those listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)).
59
 
60
  In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted.
61