Update README.md
Browse files
README.md
CHANGED
@@ -8,6 +8,8 @@ language: ar
|
|
8 |
This is a small GPT-2 model retrained on Arabic Wikipedia circa September 2020
|
9 |
(due to memory limits, the first 600,000 lines of the Wiki dump)
|
10 |
|
|
|
|
|
11 |
Training notebook: https://colab.research.google.com/drive/1Z_935vTuZvbseOsExCjSprrqn1MsQT57
|
12 |
|
13 |
Steps to training:
|
@@ -24,3 +26,34 @@ Steps to training:
|
|
24 |
am = AutoModel.from_pretrained('./argpt', from_tf=True)
|
25 |
am.save_pretrained("./")
|
26 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
This is a small GPT-2 model retrained on Arabic Wikipedia circa September 2020
|
9 |
(due to memory limits, the first 600,000 lines of the Wiki dump)
|
10 |
|
11 |
+
## Training
|
12 |
+
|
13 |
Training notebook: https://colab.research.google.com/drive/1Z_935vTuZvbseOsExCjSprrqn1MsQT57
|
14 |
|
15 |
Steps to training:
|
|
|
26 |
am = AutoModel.from_pretrained('./argpt', from_tf=True)
|
27 |
am.save_pretrained("./")
|
28 |
```
|
29 |
+
|
30 |
+
## Generating text in SimpleTransformers
|
31 |
+
|
32 |
+
Finetuning notebook: https://colab.research.google.com/drive/1fXFH7g4nfbxBo42icI4ZMy-0TAGAxc2i
|
33 |
+
|
34 |
+
```python
|
35 |
+
from simpletransformers.language_generation import LanguageGenerationModel
|
36 |
+
model = LanguageGenerationModel("gpt2", "monsoon-nlp/sanaa")
|
37 |
+
model.generate("مدرستي")
|
38 |
+
```
|
39 |
+
|
40 |
+
## Finetuning dialects in SimpleTransformers
|
41 |
+
|
42 |
+
I finetuned this model on different Arabic dialects to generate a new
|
43 |
+
model (monsoon-nlp/sanaa-dialect on HuggingFace) with some additional
|
44 |
+
control tokens.
|
45 |
+
|
46 |
+
Finetuning notebook: https://colab.research.google.com/drive/1fXFH7g4nfbxBo42ic$
|
47 |
+
|
48 |
+
```python
|
49 |
+
from simpletransformers.language_modeling import LanguageModelingModel
|
50 |
+
ft_model = LanguageModelingModel('gpt2', 'monsoon-nlp/sanaa', args=train_args)
|
51 |
+
ft_model.tokenizer.add_tokens(["[EGYPTIAN]", "[MSA]", "[LEVANTINE]", "[GULF]"])
|
52 |
+
ft_model.model.resize_token_embeddings(len(ft_model.tokenizer))
|
53 |
+
ft_model.train_model("./train.txt", eval_file="./test.txt")
|
54 |
+
|
55 |
+
# exported model
|
56 |
+
from simpletransformers.language_generation import LanguageGenerationModel
|
57 |
+
model = LanguageGenerationModel("gpt2", "./dialects")
|
58 |
+
model.generate('[EGYPTIAN]' + "مدرستي")
|
59 |
+
```
|