sho-takase commited on
Commit
392c934
·
verified ·
1 Parent(s): 6f0a8bf

Add description to readme

Browse files
Files changed (1) hide show
  1. README.md +68 -3
README.md CHANGED
@@ -1,3 +1,68 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ja
5
+ - en
6
+ ---
7
+
8
+ # Sarashina2-70B
9
+
10
+ This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
11
+
12
+
13
+ ## How to use
14
+
15
+
16
+ ```python
17
+ import torch
18
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
19
+
20
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-70b", torch_dtype=torch.bfloat16, device_map="auto")
21
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-70b")
22
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
23
+ set_seed(123)
24
+
25
+ text = generator(
26
+ "おはようございます、今日の天気は",
27
+ max_length=30,
28
+ do_sample=True,
29
+ pad_token_id=tokenizer.pad_token_id,
30
+ num_return_sequences=3,
31
+ )
32
+
33
+ for t in text:
34
+ print(t)
35
+
36
+ ```
37
+
38
+ ## Configuration
39
+
40
+ | Parameters | Vocab size | Training tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
41
+ | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
42
+ | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
43
+ | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
44
+ | [13B](https://huggingface.co/sbintuitions/sarashina2-70b) | 102400 | 2.1T | Llama2 | RoPE | 80 | 8192 | 64 |
45
+
46
+ ## Training Corpus
47
+
48
+ For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
49
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
50
+ After cleaning, our Japanese training data contains about 1T tokens.
51
+
52
+ For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
53
+
54
+ ## Tokenization
55
+
56
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
57
+ We do not apply pre-tokenization with Japanese tokenizer.
58
+ Thus, a user may directly feed raw sentences into the tokenizer.
59
+
60
+
61
+ ## Ethical Considerations and Limitations
62
+ Sarashina2 has not been tuned to follow an instruction yet.
63
+ Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
64
+ Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
65
+
66
+ ## License
67
+
68
+ [MIT License](https://huggingface.co/sbintuitions/sarashina2-70b/blob/main/LICENSE)