metadata
language:
- en
license: other
library_name: transformers
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-3B-Instruct
- toddie314/toddric-3b-merged-v3
tags:
- qwen
- qwen2.5
- instruct
- lora-merged
- persona
- assistant
- bitsandbytes
- 4-bit
- nf4
metrics:
- acceptance-pass-rate
- tokens-per-second
model-index:
- name: toddric-3b-merged-v3-bnb4
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: persona_acceptance_suite
type: internal
metrics:
- name: acceptance-pass-rate
type: exact_match
value: 1
- name: tokens-per-second
type: throughput
value: 9.5
toddric-3b-merged-v3-bnb4
Type: Qwen2.5-3B-Instruct, bnb-4bit (NF4, double-quant, bf16 compute)
What: 4-bit export of toddric-3b-merged-v3
for lower VRAM
TL;DR
- Same model as
…-merged-v3
, packaged in bitsandbytes 4-bit. - Lower VRAM footprint (runs comfortably on 8 GB).
- Slightly slower than bf16 on the same GPU (trade VRAM for speed).
- Requires
bitsandbytes
at runtime.
How to load (Transformers + bitsandbytes)
from transformers import AutoTokenizer, AutoModelForCausalLM
mid = "toddie314/toddric-3b-merged-v3-bnb4"
tok = AutoTokenizer.from_pretrained(mid, use_fast=True)
tok.padding_side = "left"
tok.pad_token = tok.pad_token or tok.eos_token
# Quantization settings are saved in quantization_config.json
model = AutoModelForCausalLM.from_pretrained(
mid,
device_map={"": 0},
attn_implementation="eager",
low_cpu_mem_usage=True,
)
model.config.use_cache = True
model.generation_config.pad_token_id = tok.pad_token_id
If your stack doesn’t auto-install bitsandbytes,
pip install bitsandbytes
.
Evaluation (acceptance pack)
57-prompt persona acceptance suite (content constraints + speed).
- Result (bnb-4bit, RTX 4060 8 GB): 100% pass; median ~9.5 tok/s (min ~9.0, max ~9.8) with
attn="eager"
.
Expect slightly lower throughput than bf16 on the same card; benefit is memory.
Notes / Gotchas
- Use
attn_implementation="eager"
on 8–12 GB GPUs for predictable speed. - If you see warnings about unused quant keys, ensure
quantization_config.json
matches bitsandbytes NF4. - Greedy defaults provided via
generation_config.json
. Enable sampling for creative tasks.
Provenance
This is a 4-bit export of toddie314/toddric-3b-merged-v3
. See that card for training details and data notes.