numiros
/

Comma-Epsilon-v0.1

@@ -5,6 +5,183 @@ datasets:
 - DataProvenanceInitiative/Commercially-Verified-Licenses
 - sablo/oasst2_curated
 - Anthropic/hh-rlhf
 pipeline_tag: text-generation
 library_name: transformers
----

 - DataProvenanceInitiative/Commercially-Verified-Licenses
 - sablo/oasst2_curated
 - Anthropic/hh-rlhf
+- CohereLabs/aya_dataset
 pipeline_tag: text-generation
 library_name: transformers
+---
+# Comma Epsilon v0.1
+Comma Epsilon v0.1 is an experimental finetune of [Comma v0.1 2T](https://huggingface.co/common-pile/comma-v0.1-2t), trained on commercially licensed* instruction and preference data.
+## Sample usage
+<details>
+```python
+from transformers import AutoTokenizer, LlamaForCausalLM
+model_name = 'numiros/Comma-Epsilon-v0.1'
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = LlamaForCausalLM.from_pretrained(model_name,
+                                         torch_dtype="auto",
+                                         device_map="auto")
+def generate(messages):
+    gen_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
+    input_ids = gen_input['input_ids']
+    attention_mask = gen_input['attention_mask']
+    generated_ids = model.generate(input_ids=input_ids,
+                                   attention_mask=attention_mask,
+                                   max_new_tokens=750,
+                                   temperature=0.3,
+                                   min_p=0.1,
+                                   repetition_penalty=1.2,
+                                   do_sample=True,
+                                   eos_token_id=tokenizer.eos_token_id,
+                                   pad_token_id=tokenizer.pad_token_id)
+    response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
+    return response
+# A simple (but inefficient) chat example:
+def get_prompt():
+    l = []
+    while True:
+        x = input('> ')
+        if x == '/end':
+            return '\n'.join(l).strip()
+        l.append(x)
+empty_messages = [
+    # {'role': 'system', 'content': 'You are a helpful assistant.'} # not recommended
+]
+print('Type /end in a new line to complete your message')
+messages = list(empty_messages)
+while True:
+    prompt = get_prompt()
+    if prompt == '/clear':
+        messages = list(empty_messages)
+        continue
+    messages.append({'role': 'user', 'content': prompt})
+    response = generate(messages)
+    print(response)
+    messages.append({'role': 'assistant', 'content': response})
+```
+</details>
+## Recipe
+<details>
+### Datasets
+We used the following datasets:
+- https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses
+- https://huggingface.co/datasets/sablo/oasst2_curated
+- English subset of https://huggingface.co/datasets/CohereLabs/aya_dataset
+- https://huggingface.co/datasets/Anthropic/hh-rlhf (the version over at https://huggingface.co/datasets/yakazimir/preference_tuning_hh)
+<details>
+<summary> SFT data mixture 1 </summary>
+This data mixture was ~50M tokens corresponding to ~200k samples. These datasets (other than Open Assistant v2) were sourced from https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses
+Datasets used in full:
+- Dolly 15k
+- Open Assistant OctoPack
+- Open Assistant v2
+- Aya Dataset
+- StarCoder Self-Instruct
+- Joke Explanation
+Datasets sampled (tokens corresponding to fraction of samples taken in parentheses):
+- Flan Collection (Chain-of-Thought) (~12M tokens)
+- Flan Collection (Super-NaturalInstructions) (~12M tokens)
+- Tasksource Instruct (~10M tokens)
+- OIG (~5M tokens)
+- Flan Collection (Flan 2021) (~3M tokens)
+- CommitPackFT (~1M tokens)
+</details>
+<details>
+<summary> SFT data mixture 2 </summary>
+Same as SFT data mixture 1, but with a different seed for whenever sampling is done.
+</details>
+<details>
+<summary> DPO data mixture </summary>
+Effectively a 10% sample of the HH RLHF dataset due to early stopping.
+</details>
+### Training details
+The training was done on 2 Nvidia RTX 3090s using axolotl and took a total of about 24 (12 + 10 + 2) hours.
+It consisted of 3 phases - SFT-1, SFT-2, DPO. All training was done using relatively high rank LoRA adapters (and the model was loaded in 8-bit precision). The SFT stages used Cut Cross Entropy Loss and Liger kernels, while DPO did not.
+#### SFT-1
+Chat template tokens were added to the tokenizer in this phase, so embeddings and the LM head were trained in this phase too.
+Parameters:
+- 1 epoch
+- Peak LR = 4e-5, cosine annealing
+- Optimizer: AdamW 8-bit
+- Global batch size = 8
+- Warmup for 3% of the training
+- LoRA rank = 256, alpha = 512
+#### SFT-2
+Parameters:
+- 1 epoch
+- Peak LR = 2e-5, cosine annealing
+- Optimizer: AdamW 8-bit
+- Global batch size = 8
+- Warmup for 3% of the training
+- LoRA rank = 128, alpha = 256
+#### DPO
+Parameters:
+- 0.1 epochs
+- Peak LR = 1e-5, cosine annealing
+- Optimizer: AdamW 8-bit
+- Global batch size = 32
+- Warmup for the first 20 steps
+- LoRA rank = 128, alpha = 256
+</details>
+## Limitations & Intended Use
+This model is best thought of as a research artifact, not a polished product. It was the result of almost a single training run with significant data and compute constraints. For any production use case, you should consider performing an additional layer of fine-tuning and alignment.
+Due to data and compute constraints, as well as a scarcity of high-quality data, there was a notable lack of experimentation (read: this was a one-shot run, so things might be off). We didn't scale training to usual post-training scales, and neither did we do any form of RL for math/coding/structured outputs/tool use. We also did not perform mid-training, which is a costly but effective technique used in many SOTA models. Consequently, this model might not perform up to your expectations.
+The model has limited preference alignment from a small sample of the HH-RLHF dataset and may generate misaligned outputs from time to time. Furthermore, it was not trained with a system prompt due to a lack of useful data, which can reduce its steerability.
+We have not performed any specific debiasing. The training data is sourced from broad internet and instructional datasets and will inevitably contain the biases present in that data. The model can and will generate text that reflects these societal biases. Handle with care and be aware of this when using it for any downstream task.
+All limitations from the base model also apply here. We strongly recommend reviewing its model card.
+## Footnotes and disclaimer
+***This is not legal advice.**