metadata

language:
  - en
license: other
library_name: transformers
pipeline_tag: text-generation
base_model:
  - Qwen/Qwen2.5-3B-Instruct
  - toddie314/toddric-3b-merged-v3
tags:
  - qwen
  - qwen2.5
  - instruct
  - lora-merged
  - persona
  - assistant
  - bitsandbytes
  - 4-bit
  - nf4
metrics:
  - acceptance-pass-rate
  - tokens-per-second
model-index:
  - name: toddric-3b-merged-v3-bnb4
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: persona_acceptance_suite
          type: internal
        metrics:
          - name: acceptance-pass-rate
            type: exact_match
            value: 1
          - name: tokens-per-second
            type: throughput
            value: 9.5

toddric-3b-merged-v3-bnb4

Type: Qwen2.5-3B-Instruct, bnb-4bit (NF4, double-quant, bf16 compute)
What: 4-bit export of toddric-3b-merged-v3 for lower VRAM

TL;DR

Same model as …-merged-v3, packaged in bitsandbytes 4-bit.
Lower VRAM footprint (runs comfortably on 8 GB).
Slightly slower than bf16 on the same GPU (trade VRAM for speed).
Requires bitsandbytes at runtime.

How to load (Transformers + bitsandbytes)

from transformers import AutoTokenizer, AutoModelForCausalLM

mid = "toddie314/toddric-3b-merged-v3-bnb4"
tok = AutoTokenizer.from_pretrained(mid, use_fast=True)
tok.padding_side = "left"
tok.pad_token = tok.pad_token or tok.eos_token

# Quantization settings are saved in quantization_config.json
model = AutoModelForCausalLM.from_pretrained(
    mid,
    device_map={"": 0},
    attn_implementation="eager",
    low_cpu_mem_usage=True,
)
model.config.use_cache = True
model.generation_config.pad_token_id = tok.pad_token_id

If your stack doesn’t auto-install bitsandbytes, pip install bitsandbytes.

Evaluation (acceptance pack)

57-prompt persona acceptance suite (content constraints + speed).

Result (bnb-4bit, RTX 4060 8 GB): 100% pass; median ~9.5 tok/s (min ~9.0, max ~9.8) with attn="eager".

Expect slightly lower throughput than bf16 on the same card; benefit is memory.

Notes / Gotchas

Use attn_implementation="eager" on 8–12 GB GPUs for predictable speed.
If you see warnings about unused quant keys, ensure quantization_config.json matches bitsandbytes NF4.
Greedy defaults provided via generation_config.json. Enable sampling for creative tasks.

Provenance

This is a 4-bit export of toddie314/toddric-3b-merged-v3. See that card for training details and data notes.