Update README.md
Browse files
README.md
CHANGED
@@ -19,14 +19,17 @@ pipeline_tag: text-generation
|
|
19 |
|
20 |
# Model Description
|
21 |
|
22 |
-
**KafkaLM‑15B‑Base** is a 15‑billion‑parameter, sparsity‑aware language model distilled from *Mistral‑Small‑24B‑Base‑2501
|
23 |
-
|
|
|
24 |
|
25 |
| Stage | What we did | Why it matters |
|
26 |
|-------|-------------|----------------|
|
27 |
-
| **1. SimplePrune** | Applied a hierarchical, hardware‑aware pruning pipeline that combines block‑, channel‑ and 2:4 structured sparsity (≈
|
28 |
-
| **2. Teacher calibration** | Briefly fine‑tuned the unpruned 24
|
29 |
-
| **3. Knowledge distillation** | Distilled the calibrated teacher into the pruned 15
|
|
|
|
|
30 |
|
31 |
**Key capabilities**
|
32 |
|
|
|
19 |
|
20 |
# Model Description
|
21 |
|
22 |
+
**KafkaLM‑15B‑Base** is a 15‑billion‑parameter, sparsity‑aware language model distilled from *Mistral‑Small‑24B‑Base‑2501* and further post trained (SFT + DPO + GRPO /w verifiable rewards).
|
23 |
+
|
24 |
+
This experimental model was created in five stages:
|
25 |
|
26 |
| Stage | What we did | Why it matters |
|
27 |
|-------|-------------|----------------|
|
28 |
+
| **1. SimplePrune** | Applied a hierarchical, hardware‑aware pruning pipeline that combines block‑, channel‑ and 2:4 structured sparsity (≈ 37.5 % parameter reduction) | Slashes memory footprint while minimizing perplexity degradation |
|
29 |
+
| **2. Teacher calibration** | Briefly fine‑tuned the unpruned 24 B teacher on a 10 B‑token multilingual European corpus on a AMD M300A cluster | Produces stable logits and hidden states for distillation |
|
30 |
+
| **3. Knowledge distillation** | Distilled the calibrated teacher into the pruned 15 B student using a **fused loss**:<br/>`L Pooled SquareHead + LKL + 0.25 * LCE` | Transfers teacher capabiities effectively with <15B tokens **(< 2 epochs)** on 64 MI300A nodes |
|
31 |
+
| **4. SFT+DPO** | Supervised finetuning + Direct Preference Optimization) on curated open-source multilingual and multitask datasets | Enhances model alignment with human preferences while preserving multilingual capabilities |
|
32 |
+
| **5. RL** | Trained GRPO as separate LoRA adapter to make it easy for serving and optional for using | Enables flexible deployment with optional reinforcement learning benefits without modifying the base model |
|
33 |
|
34 |
**Key capabilities**
|
35 |
|