Add paper link, project page and clarify training procedure

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -1,18 +1,17 @@
1
-
2
  ---
3
- license: apache-2.0
4
  datasets:
5
  - oscar-corpus/OSCAR-2109
6
  language:
7
  - en
8
  - nl
9
- pipeline_tag: text-generation
10
  library_name: transformers
 
 
11
  ---
12
 
13
  # B-GPT_en_nl_sequential
14
 
15
- This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on English data. In the second half of training, the model was trained on only Dutch data. At the end of training, 50% of training data seen by the model is English and 50% is Dutch. The tokenizer was trained on the same overall proportions of data as the language model at the final step.
16
 
17
  ## Model details:
18
 
@@ -38,31 +37,28 @@ Load the model:
38
 
39
  Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.
40
 
41
- ```
42
- from transformers import AutoTokenizer, AutoModel
43
 
44
  tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_en_nl_sequential")
45
- model = AutoModel.from_pretrained("catherinearnett/B-GPT_en_nl_sequential", revision = "128000")
46
-
47
-
48
- ````
49
 
50
  Text Generation:
51
 
52
- ```
53
  from transformers import pipeline
54
 
55
  pipe = pipeline("text-generation", model="catherinearnett/B-GPT_en_nl_sequential")
56
 
57
- pipe("I am a")
58
-
59
  ```
60
 
61
  ## Citation
62
 
63
  If you use this model, please cite:
64
 
65
- ```
66
  @article{arnett2025acquisition,
67
  author = {Catherine Arnett and Tyler A. Chang and James A. Michaelov and Benjamin K. Bergen},
68
  title = {On the Acquisition of Shared Grammatical Representations in Bilingual Language Models},
@@ -71,3 +67,7 @@ If you use this model, please cite:
71
  url = {https://arxiv.org/abs/2503.03962}
72
  }
73
  ```
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - oscar-corpus/OSCAR-2109
4
  language:
5
  - en
6
  - nl
 
7
  library_name: transformers
8
+ license: apache-2.0
9
+ pipeline_tag: text-generation
10
  ---
11
 
12
  # B-GPT_en_nl_sequential
13
 
14
+ This is a bilingual GPT-2 style model trained using a sequential approach. The first half of training used only English data, followed by the second half using only Dutch data. The final model has been exposed to roughly equal proportions of English and Dutch text (50% each). The tokenizer was also trained on a similar proportion of English and Dutch data.
15
 
16
  ## Model details:
17
 
 
37
 
38
  Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.
39
 
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForCausalLM
42
 
43
  tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_en_nl_sequential")
44
+ model = AutoModelForCausalLM.from_pretrained("catherinearnett/B-GPT_en_nl_sequential", revision = "128000")
45
+ ```
 
 
46
 
47
  Text Generation:
48
 
49
+ ```python
50
  from transformers import pipeline
51
 
52
  pipe = pipeline("text-generation", model="catherinearnett/B-GPT_en_nl_sequential")
53
 
54
+ print(pipe("I am a", max_length=20)[0]["generated_text"])
 
55
  ```
56
 
57
  ## Citation
58
 
59
  If you use this model, please cite:
60
 
61
+ ```bibtex
62
  @article{arnett2025acquisition,
63
  author = {Catherine Arnett and Tyler A. Chang and James A. Michaelov and Benjamin K. Bergen},
64
  title = {On the Acquisition of Shared Grammatical Representations in Bilingual Language Models},
 
67
  url = {https://arxiv.org/abs/2503.03962}
68
  }
69
  ```
70
+
71
+ This model was presented in the paper [On the Acquisition of Shared Grammatical Representations in Bilingual Language Models](https://arxiv.org/abs/2503.03962).
72
+
73
+ Project Page: https://osf.io/5cw2e/