seedboxai
/

KafkaLM-15B

Text Generation

text-generation-inference

Model card Files Files and versions

doubledsbv commited on May 6

Commit

f6fd023

·

verified ·

1 Parent(s): 3a5baef

Update README.md

Files changed (1) hide show

README.md +8 -5

README.md CHANGED Viewed

@@ -19,14 +19,17 @@ pipeline_tag: text-generation
 # Model Description
-**KafkaLM‑15B‑Base** is a 15‑billion‑parameter, sparsity‑aware language model distilled from *Mistral‑Small‑24B‑Base‑2501*.
-This experimental model was created in three stages:
 | Stage | What we did | Why it matters |
 |-------|-------------|----------------|
-| **1. SimplePrune** | Applied a hierarchical, hardware‑aware pruning pipeline that combines block‑, channel‑ and 2:4 structured sparsity (≈ 37.5 % parameter reduction) | Slashes memory footprint while minimizing perplexity degradation |
-| **2. Teacher calibration** | Briefly fine‑tuned the unpruned 24 B teacher on a 10 B‑token multilingual European corpus on a AMD M300A cluster | Produces stable logits and hidden states for distillation  |
-| **3. Knowledge distillation** | Distilled the calibrated teacher into the pruned 15 B student using a **fused loss**:<br/>`L Pooled SquareHead + LKL + 0.25 * LCE` | Transfers teacher capabiities effectively with <15B tokens **(< 2 epochs)** on 64 MI300A nodes |
 **Key capabilities**

 # Model Description
+**KafkaLM‑15B‑Base** is a 15‑billion‑parameter, sparsity‑aware language model distilled from *Mistral‑Small‑24B‑Base‑2501* and further post trained (SFT + DPO + GRPO /w verifiable rewards).
+This experimental model was created in five stages:
 | Stage | What we did | Why it matters |
 |-------|-------------|----------------|
+| **1. SimplePrune** | Applied a hierarchical, hardware‑aware pruning pipeline that combines block‑, channel‑ and 2:4 structured sparsity (≈ 37.5 % parameter reduction) | Slashes memory footprint while minimizing perplexity degradation |
+| **2. Teacher calibration** | Briefly fine‑tuned the unpruned 24 B teacher on a 10 B‑token multilingual European corpus on a AMD M300A cluster | Produces stable logits and hidden states for distillation  |
+| **3. Knowledge distillation** | Distilled the calibrated teacher into the pruned 15 B student using a **fused loss**:<br/>`L Pooled SquareHead + LKL + 0.25 * LCE` | Transfers teacher capabiities effectively with <15B tokens **(< 2 epochs)** on 64 MI300A nodes |
+| **4. SFT+DPO** | Supervised finetuning + Direct Preference Optimization) on curated open-source multilingual and multitask datasets | Enhances model alignment with human preferences while preserving multilingual capabilities |
+| **5. RL** | Trained GRPO as separate LoRA adapter to make it easy for serving and optional for using | Enables flexible deployment with optional reinforcement learning benefits without modifying the base model |
 **Key capabilities**