Add paper link, project page and clarify training procedure
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,18 +1,17 @@
|
|
1 |
-
|
2 |
---
|
3 |
-
license: apache-2.0
|
4 |
datasets:
|
5 |
- oscar-corpus/OSCAR-2109
|
6 |
language:
|
7 |
- en
|
8 |
- nl
|
9 |
-
pipeline_tag: text-generation
|
10 |
library_name: transformers
|
|
|
|
|
11 |
---
|
12 |
|
13 |
# B-GPT_en_nl_sequential
|
14 |
|
15 |
-
This is a bilingual GPT-2 style model.
|
16 |
|
17 |
## Model details:
|
18 |
|
@@ -38,31 +37,28 @@ Load the model:
|
|
38 |
|
39 |
Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.
|
40 |
|
41 |
-
```
|
42 |
-
from transformers import AutoTokenizer,
|
43 |
|
44 |
tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_en_nl_sequential")
|
45 |
-
model =
|
46 |
-
|
47 |
-
|
48 |
-
````
|
49 |
|
50 |
Text Generation:
|
51 |
|
52 |
-
```
|
53 |
from transformers import pipeline
|
54 |
|
55 |
pipe = pipeline("text-generation", model="catherinearnett/B-GPT_en_nl_sequential")
|
56 |
|
57 |
-
pipe("I am a")
|
58 |
-
|
59 |
```
|
60 |
|
61 |
## Citation
|
62 |
|
63 |
If you use this model, please cite:
|
64 |
|
65 |
-
```
|
66 |
@article{arnett2025acquisition,
|
67 |
author = {Catherine Arnett and Tyler A. Chang and James A. Michaelov and Benjamin K. Bergen},
|
68 |
title = {On the Acquisition of Shared Grammatical Representations in Bilingual Language Models},
|
@@ -71,3 +67,7 @@ If you use this model, please cite:
|
|
71 |
url = {https://arxiv.org/abs/2503.03962}
|
72 |
}
|
73 |
```
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
2 |
datasets:
|
3 |
- oscar-corpus/OSCAR-2109
|
4 |
language:
|
5 |
- en
|
6 |
- nl
|
|
|
7 |
library_name: transformers
|
8 |
+
license: apache-2.0
|
9 |
+
pipeline_tag: text-generation
|
10 |
---
|
11 |
|
12 |
# B-GPT_en_nl_sequential
|
13 |
|
14 |
+
This is a bilingual GPT-2 style model trained using a sequential approach. The first half of training used only English data, followed by the second half using only Dutch data. The final model has been exposed to roughly equal proportions of English and Dutch text (50% each). The tokenizer was also trained on a similar proportion of English and Dutch data.
|
15 |
|
16 |
## Model details:
|
17 |
|
|
|
37 |
|
38 |
Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.
|
39 |
|
40 |
+
```python
|
41 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
42 |
|
43 |
tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_en_nl_sequential")
|
44 |
+
model = AutoModelForCausalLM.from_pretrained("catherinearnett/B-GPT_en_nl_sequential", revision = "128000")
|
45 |
+
```
|
|
|
|
|
46 |
|
47 |
Text Generation:
|
48 |
|
49 |
+
```python
|
50 |
from transformers import pipeline
|
51 |
|
52 |
pipe = pipeline("text-generation", model="catherinearnett/B-GPT_en_nl_sequential")
|
53 |
|
54 |
+
print(pipe("I am a", max_length=20)[0]["generated_text"])
|
|
|
55 |
```
|
56 |
|
57 |
## Citation
|
58 |
|
59 |
If you use this model, please cite:
|
60 |
|
61 |
+
```bibtex
|
62 |
@article{arnett2025acquisition,
|
63 |
author = {Catherine Arnett and Tyler A. Chang and James A. Michaelov and Benjamin K. Bergen},
|
64 |
title = {On the Acquisition of Shared Grammatical Representations in Bilingual Language Models},
|
|
|
67 |
url = {https://arxiv.org/abs/2503.03962}
|
68 |
}
|
69 |
```
|
70 |
+
|
71 |
+
This model was presented in the paper [On the Acquisition of Shared Grammatical Representations in Bilingual Language Models](https://arxiv.org/abs/2503.03962).
|
72 |
+
|
73 |
+
Project Page: https://osf.io/5cw2e/
|