Text Generation
Transformers
Safetensors
llama
text-generation-inference
dtamayo commited on
Commit
c28b67e
·
verified ·
1 Parent(s): 4840920

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -73,7 +73,7 @@ This model card corresponds to the 2B base version.
73
 
74
  To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).
75
 
76
- The entire Salamandra family is released under a permissive [Apache 2.0 license]((https://www.apache.org/licenses/LICENSE-2.0)).
77
  Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/salamandra).
78
 
79
  ---
@@ -87,7 +87,7 @@ The pre-training corpus contains text in 35 European languages and code.
87
 
88
  ### Hyperparameters
89
 
90
- The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
91
 
92
  ### Architecture
93
 
@@ -141,7 +141,7 @@ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/mar
141
  operated by Barcelona Supercomputing Center.
142
 
143
  The accelerated partition is composed of 1,120 nodes with the following specifications:
144
- - 4x Nvidia Hopper GPUs with 64 HBM2 memory
145
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
146
  - 4x NDR200 (BW per node 800Gb/s)
147
  - 512 GB of Main memory (DDR5)
@@ -725,7 +725,7 @@ We only use tasks that are either human generated, human translated, or with a s
725
 
726
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
727
 
728
- It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
729
 
730
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
731
 
@@ -1062,7 +1062,7 @@ All results reported below are on a 5-shot setting.
1062
 
1063
  We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report inadequate accuracies in both ambiguous and disambiguated contexts, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
1064
 
1065
- Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe moderate to strong to very strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects, implying that outputs can be influenced by the prompts.
1066
 
1067
  Our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
1068
 
 
73
 
74
  To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).
75
 
76
+ The entire Salamandra family is released under a permissive [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
77
  Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/salamandra).
78
 
79
  ---
 
87
 
88
  ### Hyperparameters
89
 
90
+ The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/blob/main/configs/bsc_2b.yaml).
91
 
92
  ### Architecture
93
 
 
141
  operated by Barcelona Supercomputing Center.
142
 
143
  The accelerated partition is composed of 1,120 nodes with the following specifications:
144
+ - 4x Nvidia Hopper GPUs with 64GB HBM2 memory
145
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
146
  - 4x NDR200 (BW per node 800Gb/s)
147
  - 512 GB of Main memory (DDR5)
 
725
 
726
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
727
 
728
+ It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the model's capabilities and potential. We thus advise caution when reading and interpreting the results.
729
 
730
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
731
 
 
1062
 
1063
  We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report inadequate accuracies in both ambiguous and disambiguated contexts, which is indicative of the presence of societal biases which need to be addressed in post-training phases.
1064
 
1065
+ Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe moderate to strong to very strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure the effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects, implying that outputs can be influenced by the prompts.
1066
 
1067
  Our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
1068