toddric-3b-merged-v3-bnb4
Type: Qwen2.5-3B-Instruct, bnb-4bit (NF4, double-quant, bf16 compute)
What: 4-bit export of toddric-3b-merged-v3
for lower VRAM
TL;DR
- Same model as
…-merged-v3
, packaged in bitsandbytes 4-bit. - Lower VRAM footprint (runs comfortably on 8 GB).
- Slightly slower than bf16 on the same GPU (trade VRAM for speed).
- Requires
bitsandbytes
at runtime.
How to load (Transformers + bitsandbytes)
from transformers import AutoTokenizer, AutoModelForCausalLM
mid = "toddie314/toddric-3b-merged-v3-bnb4"
tok = AutoTokenizer.from_pretrained(mid, use_fast=True)
tok.padding_side = "left"
tok.pad_token = tok.pad_token or tok.eos_token
# Quantization settings are saved in quantization_config.json
model = AutoModelForCausalLM.from_pretrained(
mid,
device_map={"": 0},
attn_implementation="eager",
low_cpu_mem_usage=True,
)
model.config.use_cache = True
model.generation_config.pad_token_id = tok.pad_token_id
If your stack doesn’t auto-install bitsandbytes,
pip install bitsandbytes
.
Evaluation (acceptance pack)
57-prompt persona acceptance suite (content constraints + speed).
- Result (bnb-4bit, RTX 4060 8 GB): 100% pass; median ~9.5 tok/s (min ~9.0, max ~9.8) with
attn="eager"
.
Expect slightly lower throughput than bf16 on the same card; benefit is memory.
Notes / Gotchas
- Use
attn_implementation="eager"
on 8–12 GB GPUs for predictable speed. - If you see warnings about unused quant keys, ensure
quantization_config.json
matches bitsandbytes NF4. - Greedy defaults provided via
generation_config.json
. Enable sampling for creative tasks.
Provenance
This is a 4-bit export of toddie314/toddric-3b-merged-v3
. See that card for training details and data notes.
- Downloads last month
- 12
Model tree for toddie314/toddric-3b-merged-v3-bnb4
Evaluation results
- acceptance-pass-rate on persona_acceptance_suiteself-reported1.000
- tokens-per-second on persona_acceptance_suiteself-reported9.500