doubledsbv commited on
Commit
f6fd023
·
verified ·
1 Parent(s): 3a5baef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -19,14 +19,17 @@ pipeline_tag: text-generation
19
 
20
  # Model Description
21
 
22
- **KafkaLM‑15B‑Base** is a 15‑billion‑parameter, sparsity‑aware language model distilled from *Mistral‑Small‑24B‑Base‑2501*.
23
- This experimental model was created in three stages:
 
24
 
25
  | Stage | What we did | Why it matters |
26
  |-------|-------------|----------------|
27
- | **1. SimplePrune** | Applied a hierarchical, hardware‑aware pruning pipeline that combines block‑, channel‑ and 2:4 structured sparsity (≈37.5% parameter reduction) | Slashes memory footprint while minimizing perplexity degradation |
28
- | **2. Teacher calibration** | Briefly fine‑tuned the unpruned 24B teacher on a 10B‑token multilingual European corpus on a AMD M300A cluster | Produces stable logits and hidden states for distillation |
29
- | **3. Knowledge distillation** | Distilled the calibrated teacher into the pruned 15B student using a **fused loss**:<br/>`L Pooled SquareHead+LKL+0.25*LCE` | Transfers teacher capabiities effectively with <15B tokens **(<2epochs)** on 64 MI300A nodes |
 
 
30
 
31
  **Key capabilities**
32
 
 
19
 
20
  # Model Description
21
 
22
+ **KafkaLM‑15B‑Base** is a 15‑billion‑parameter, sparsity‑aware language model distilled from *Mistral‑Small‑24B‑Base‑2501* and further post trained (SFT + DPO + GRPO /w verifiable rewards).
23
+
24
+ This experimental model was created in five stages:
25
 
26
  | Stage | What we did | Why it matters |
27
  |-------|-------------|----------------|
28
+ | **1. SimplePrune** | Applied a hierarchical, hardware‑aware pruning pipeline that combines block‑, channel‑ and 2:4 structured sparsity (≈ 37.5 % parameter reduction) | Slashes memory footprint while minimizing perplexity degradation |
29
+ | **2. Teacher calibration** | Briefly fine‑tuned the unpruned 24 B teacher on a 10 B‑token multilingual European corpus on a AMD M300A cluster | Produces stable logits and hidden states for distillation |
30
+ | **3. Knowledge distillation** | Distilled the calibrated teacher into the pruned 15 B student using a **fused loss**:<br/>`L Pooled SquareHead + LKL + 0.25 * LCE` | Transfers teacher capabiities effectively with <15B tokens **(< 2 epochs)** on 64 MI300A nodes |
31
+ | **4. SFT+DPO** | Supervised finetuning + Direct Preference Optimization) on curated open-source multilingual and multitask datasets | Enhances model alignment with human preferences while preserving multilingual capabilities |
32
+ | **5. RL** | Trained GRPO as separate LoRA adapter to make it easy for serving and optional for using | Enables flexible deployment with optional reinforcement learning benefits without modifying the base model |
33
 
34
  **Key capabilities**
35