161 21 509

Djuunaa

djuna

AI & ML interests

None yet

Recent Activity

liked a model about 7 hours ago

aquiffoo/aquif-3.5-8B-Think

liked a model about 7 hours ago

seedboxai/KafkaLM-15B

liked a model about 7 hours ago

IPADS-SAI/MobiMind-Grounder-3B

View all activity

Organizations

liked 4 models about 7 hours ago

liked a model 2 days ago

meituan-longcat/LongCat-Flash-Chat

Text Generation • 562B • Updated 2 days ago • 218 • 306

New activity in WestZhang/VibeVoice-Large-pt 3 days ago

Also MIT licensed?

#1 opened 7 days ago by

mrfakename

liked 2 models 3 days ago

WestZhang/VibeVoice-Large-pt

9B • Updated about 23 hours ago • 19.4k • 84

lodestones/Chroma1-HD

Text-to-Image • Updated 10 days ago • 41.3k • 166

reacted to codelion's post with 🔥 4 days ago

Post

5035

I wanted to share a technique that's been working really well for recovering performance after INT4 quantization.

Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.

Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).

We saw similar results on Qwen3-0.6B:

Perplexity: 2.40 → 2.09 (only 5.7% degradation from FP16 baseline)
Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction)
Speed: 3.0x faster inference than FP16
Quality: Generates correct, optimized code solutions

- Pre-trained adapter: codelion/Qwen3-0.6B-accuracy-recovery-lora
- GitHub repo: https://github.com/codelion/ellora

Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.

Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!

liked a model 4 days ago

inclusionAI/Qwen3-32B-AWorld

Text Generation • 33B • Updated 1 day ago • 22 • 10

liked a model 6 days ago

OpenBuddy/OpenBuddy-Qwen3-Coder-30B-A3B-Base

31B • Updated 7 days ago • 25 • 3

reacted to codelion's post with 🔥 6 days ago

Post

4904

I recently added a recipe in ellora to improve reasoning capabilities to Gemma-3-1B using self-supervised learning. Model now shows step-by-step thinking in <think> tags before answering.

Logic puzzle accuracy: 61% → 84%. 3 hours training on single GPU. 🧠

Used GRPO where model generates multiple responses and learns to prefer better reasoning. Works surprisingly well for making smaller models more transparent.

🔗 Colab: https://colab.research.google.com/github/codelion/ellora/blob/main/Ellora_Recipe_2_Reasoning_LoRA_with_Self-Rewarding_GRPO.ipynb

🤗 Model: codelion/gemma-3-1b-it-reasoning-grpo-lora

💻 Code: https://github.com/codelion/ellora