Update README.md
Browse files
README.md
CHANGED
@@ -73,7 +73,7 @@ This model card corresponds to the 2B base version.
|
|
73 |
|
74 |
To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).
|
75 |
|
76 |
-
The entire Salamandra family is released under a permissive [Apache 2.0 license](
|
77 |
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/salamandra).
|
78 |
|
79 |
---
|
@@ -87,7 +87,7 @@ The pre-training corpus contains text in 35 European languages and code.
|
|
87 |
|
88 |
### Hyperparameters
|
89 |
|
90 |
-
The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/
|
91 |
|
92 |
### Architecture
|
93 |
|
@@ -141,7 +141,7 @@ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/mar
|
|
141 |
operated by Barcelona Supercomputing Center.
|
142 |
|
143 |
The accelerated partition is composed of 1,120 nodes with the following specifications:
|
144 |
-
- 4x Nvidia Hopper GPUs with
|
145 |
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
|
146 |
- 4x NDR200 (BW per node 800Gb/s)
|
147 |
- 512 GB of Main memory (DDR5)
|
@@ -725,7 +725,7 @@ We only use tasks that are either human generated, human translated, or with a s
|
|
725 |
|
726 |
During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
|
727 |
|
728 |
-
It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the
|
729 |
|
730 |
A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
|
731 |
|
@@ -1062,7 +1062,7 @@ All results reported below are on a 5-shot setting.
|
|
1062 |
|
1063 |
We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report inadequate accuracies in both ambiguous and disambiguated contexts, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
|
1064 |
|
1065 |
-
Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe moderate to strong to very strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects, implying that outputs can be influenced by the prompts.
|
1066 |
|
1067 |
Our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
|
1068 |
|
|
|
73 |
|
74 |
To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).
|
75 |
|
76 |
+
The entire Salamandra family is released under a permissive [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
|
77 |
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/salamandra).
|
78 |
|
79 |
---
|
|
|
87 |
|
88 |
### Hyperparameters
|
89 |
|
90 |
+
The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/blob/main/configs/bsc_2b.yaml).
|
91 |
|
92 |
### Architecture
|
93 |
|
|
|
141 |
operated by Barcelona Supercomputing Center.
|
142 |
|
143 |
The accelerated partition is composed of 1,120 nodes with the following specifications:
|
144 |
+
- 4x Nvidia Hopper GPUs with 64GB HBM2 memory
|
145 |
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
|
146 |
- 4x NDR200 (BW per node 800Gb/s)
|
147 |
- 512 GB of Main memory (DDR5)
|
|
|
725 |
|
726 |
During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
|
727 |
|
728 |
+
It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the model's capabilities and potential. We thus advise caution when reading and interpreting the results.
|
729 |
|
730 |
A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
|
731 |
|
|
|
1062 |
|
1063 |
We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report inadequate accuracies in both ambiguous and disambiguated contexts, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
|
1064 |
|
1065 |
+
Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe moderate to strong to very strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure the effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects, implying that outputs can be influenced by the prompts.
|
1066 |
|
1067 |
Our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
|
1068 |
|