numiros commited on
Commit
7beb941
·
verified ·
1 Parent(s): c22c97f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -1
README.md CHANGED
@@ -5,6 +5,183 @@ datasets:
5
  - DataProvenanceInitiative/Commercially-Verified-Licenses
6
  - sablo/oasst2_curated
7
  - Anthropic/hh-rlhf
 
8
  pipeline_tag: text-generation
9
  library_name: transformers
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - DataProvenanceInitiative/Commercially-Verified-Licenses
6
  - sablo/oasst2_curated
7
  - Anthropic/hh-rlhf
8
+ - CohereLabs/aya_dataset
9
  pipeline_tag: text-generation
10
  library_name: transformers
11
+ ---
12
+
13
+ # Comma Epsilon v0.1
14
+
15
+ Comma Epsilon v0.1 is an experimental finetune of [Comma v0.1 2T](https://huggingface.co/common-pile/comma-v0.1-2t), trained on commercially licensed* instruction and preference data.
16
+
17
+ ## Sample usage
18
+
19
+ <details>
20
+
21
+ ```python
22
+ from transformers import AutoTokenizer, LlamaForCausalLM
23
+
24
+ model_name = 'numiros/Comma-Epsilon-v0.1'
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
27
+ model = LlamaForCausalLM.from_pretrained(model_name,
28
+ torch_dtype="auto",
29
+ device_map="auto")
30
+
31
+ def generate(messages):
32
+ gen_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
33
+ input_ids = gen_input['input_ids']
34
+ attention_mask = gen_input['attention_mask']
35
+ generated_ids = model.generate(input_ids=input_ids,
36
+ attention_mask=attention_mask,
37
+ max_new_tokens=750,
38
+ temperature=0.3,
39
+ min_p=0.1,
40
+ repetition_penalty=1.2,
41
+ do_sample=True,
42
+ eos_token_id=tokenizer.eos_token_id,
43
+ pad_token_id=tokenizer.pad_token_id)
44
+ response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
45
+ return response
46
+
47
+ # A simple (but inefficient) chat example:
48
+
49
+ def get_prompt():
50
+ l = []
51
+ while True:
52
+ x = input('> ')
53
+ if x == '/end':
54
+ return '\n'.join(l).strip()
55
+ l.append(x)
56
+
57
+ empty_messages = [
58
+ # {'role': 'system', 'content': 'You are a helpful assistant.'} # not recommended
59
+ ]
60
+
61
+ print('Type /end in a new line to complete your message')
62
+ messages = list(empty_messages)
63
+ while True:
64
+ prompt = get_prompt()
65
+ if prompt == '/clear':
66
+ messages = list(empty_messages)
67
+ continue
68
+ messages.append({'role': 'user', 'content': prompt})
69
+ response = generate(messages)
70
+ print(response)
71
+ messages.append({'role': 'assistant', 'content': response})
72
+ ```
73
+
74
+ </details>
75
+
76
+ ## Recipe
77
+
78
+ <details>
79
+
80
+ ### Datasets
81
+
82
+ We used the following datasets:
83
+
84
+ - https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses
85
+ - https://huggingface.co/datasets/sablo/oasst2_curated
86
+ - English subset of https://huggingface.co/datasets/CohereLabs/aya_dataset
87
+ - https://huggingface.co/datasets/Anthropic/hh-rlhf (the version over at https://huggingface.co/datasets/yakazimir/preference_tuning_hh)
88
+
89
+ <details>
90
+
91
+ <summary> SFT data mixture 1 </summary>
92
+
93
+ This data mixture was ~50M tokens corresponding to ~200k samples. These datasets (other than Open Assistant v2) were sourced from https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses
94
+
95
+ Datasets used in full:
96
+
97
+ - Dolly 15k
98
+ - Open Assistant OctoPack
99
+ - Open Assistant v2
100
+ - Aya Dataset
101
+ - StarCoder Self-Instruct
102
+ - Joke Explanation
103
+
104
+ Datasets sampled (tokens corresponding to fraction of samples taken in parentheses):
105
+
106
+ - Flan Collection (Chain-of-Thought) (~12M tokens)
107
+ - Flan Collection (Super-NaturalInstructions) (~12M tokens)
108
+ - Tasksource Instruct (~10M tokens)
109
+ - OIG (~5M tokens)
110
+ - Flan Collection (Flan 2021) (~3M tokens)
111
+ - CommitPackFT (~1M tokens)
112
+
113
+ </details>
114
+
115
+ <details>
116
+
117
+ <summary> SFT data mixture 2 </summary>
118
+
119
+ Same as SFT data mixture 1, but with a different seed for whenever sampling is done.
120
+
121
+ </details>
122
+
123
+ <details>
124
+
125
+ <summary> DPO data mixture </summary>
126
+
127
+ Effectively a 10% sample of the HH RLHF dataset due to early stopping.
128
+
129
+ </details>
130
+
131
+
132
+ ### Training details
133
+
134
+ The training was done on 2 Nvidia RTX 3090s using axolotl and took a total of about 24 (12 + 10 + 2) hours.
135
+
136
+ It consisted of 3 phases - SFT-1, SFT-2, DPO. All training was done using relatively high rank LoRA adapters (and the model was loaded in 8-bit precision). The SFT stages used Cut Cross Entropy Loss and Liger kernels, while DPO did not.
137
+
138
+ #### SFT-1
139
+
140
+ Chat template tokens were added to the tokenizer in this phase, so embeddings and the LM head were trained in this phase too.
141
+
142
+ Parameters:
143
+ - 1 epoch
144
+ - Peak LR = 4e-5, cosine annealing
145
+ - Optimizer: AdamW 8-bit
146
+ - Global batch size = 8
147
+ - Warmup for 3% of the training
148
+ - LoRA rank = 256, alpha = 512
149
+
150
+ #### SFT-2
151
+
152
+ Parameters:
153
+ - 1 epoch
154
+ - Peak LR = 2e-5, cosine annealing
155
+ - Optimizer: AdamW 8-bit
156
+ - Global batch size = 8
157
+ - Warmup for 3% of the training
158
+ - LoRA rank = 128, alpha = 256
159
+
160
+ #### DPO
161
+
162
+ Parameters:
163
+ - 0.1 epochs
164
+ - Peak LR = 1e-5, cosine annealing
165
+ - Optimizer: AdamW 8-bit
166
+ - Global batch size = 32
167
+ - Warmup for the first 20 steps
168
+ - LoRA rank = 128, alpha = 256
169
+
170
+ </details>
171
+
172
+
173
+ ## Limitations & Intended Use
174
+
175
+ This model is best thought of as a research artifact, not a polished product. It was the result of almost a single training run with significant data and compute constraints. For any production use case, you should consider performing an additional layer of fine-tuning and alignment.
176
+
177
+ Due to data and compute constraints, as well as a scarcity of high-quality data, there was a notable lack of experimentation (read: this was a one-shot run, so things might be off). We didn't scale training to usual post-training scales, and neither did we do any form of RL for math/coding/structured outputs/tool use. We also did not perform mid-training, which is a costly but effective technique used in many SOTA models. Consequently, this model might not perform up to your expectations.
178
+
179
+ The model has limited preference alignment from a small sample of the HH-RLHF dataset and may generate misaligned outputs from time to time. Furthermore, it was not trained with a system prompt due to a lack of useful data, which can reduce its steerability.
180
+
181
+ We have not performed any specific debiasing. The training data is sourced from broad internet and instructional datasets and will inevitably contain the biases present in that data. The model can and will generate text that reflects these societal biases. Handle with care and be aware of this when using it for any downstream task.
182
+
183
+ All limitations from the base model also apply here. We strongly recommend reviewing its model card.
184
+
185
+ ## Footnotes and disclaimer
186
+
187
+ ***This is not legal advice.**