toddric-3b-merged-v3-bnb4

Type: Qwen2.5-3B-Instruct, bnb-4bit (NF4, double-quant, bf16 compute)
What: 4-bit export of toddric-3b-merged-v3 for lower VRAM

TL;DR

Same model as …-merged-v3, packaged in bitsandbytes 4-bit.
Lower VRAM footprint (runs comfortably on 8 GB).
Slightly slower than bf16 on the same GPU (trade VRAM for speed).
Requires bitsandbytes at runtime.

How to load (Transformers + bitsandbytes)

from transformers import AutoTokenizer, AutoModelForCausalLM

mid = "toddie314/toddric-3b-merged-v3-bnb4"
tok = AutoTokenizer.from_pretrained(mid, use_fast=True)
tok.padding_side = "left"
tok.pad_token = tok.pad_token or tok.eos_token

# Quantization settings are saved in quantization_config.json
model = AutoModelForCausalLM.from_pretrained(
    mid,
    device_map={"": 0},
    attn_implementation="eager",
    low_cpu_mem_usage=True,
)
model.config.use_cache = True
model.generation_config.pad_token_id = tok.pad_token_id

If your stack doesn’t auto-install bitsandbytes, pip install bitsandbytes.

Evaluation (acceptance pack)

57-prompt persona acceptance suite (content constraints + speed).

Result (bnb-4bit, RTX 4060 8 GB): 100% pass; median ~9.5 tok/s (min ~9.0, max ~9.8) with attn="eager".

Expect slightly lower throughput than bf16 on the same card; benefit is memory.

Notes / Gotchas

Use attn_implementation="eager" on 8–12 GB GPUs for predictable speed.
If you see warnings about unused quant keys, ensure quantization_config.json matches bitsandbytes NF4.
Greedy defaults provided via generation_config.json. Enable sampling for creative tasks.

Provenance

This is a 4-bit export of toddie314/toddric-3b-merged-v3. See that card for training details and data notes.

toddie314
/

toddric-3b-merged-v3-bnb4