C4AI Command A - Quantized Models

This repository contains quantized versions of the C4AI Command A model, an open weights research release by Cohere and Cohere For AI. The original model is a 111 billion parameter language model optimized for enterprise use cases, excelling in agentic, multilingual, and retrieval-augmented generation (RAG) tasks while being deployable on minimal hardware (e.g., two GPUs). Here, we provide multiple quantized variants to further reduce memory footprint and enhance deployment flexibility across various hardware setups, including multi-GPU environments.

For details on the original model, refer to the official model card below.


Quantized Models

We have quantized the original CohereForAI/c4ai-command-a-03-2025 model using the bitsandbytes library with various configurations to balance performance, memory efficiency, and accuracy. Below are the available quantized versions:

Quantization Type Description Compute Dtype Double Quantization Notes
4bit_nf4_double 4-bit quantization with nf4 (Normal Float 4) bfloat16 Yes High precision with reduced memory usage
4bit_fp4 4-bit quantization with fp4 (Float Point 4) bfloat16 No Lightweight, slightly less precise
8bit_standard Standard 8-bit quantization bfloat16 N/A Balanced memory and accuracy
8bit_mixed 8-bit quantization with mixed precision and CPU offloading capability float16 N/A Flexible for constrained environments
4bit_nf4_no_double 4-bit quantization with nf4, no double quantization bfloat16 No Minimal memory footprint

These models are optimized for multi-GPU deployment using the accelerate library, ensuring efficient distribution across available GPUs. Each variant is hosted in its own sub-repository:

  • Tonic/c4ai-command-a-03-2025-4bit_nf4_double
  • Tonic/c4ai-command-a-03-2025-4bit_fp4
  • Tonic/c4ai-command-a-03-2025-8bit_standard
  • Tonic/c4ai-command-a-03-2025-8bit_mixed
  • Tonic/c4ai-command-a-03-2025-4bit_nf4_no_double

Usage

To use a quantized model, install the required dependencies and load the desired variant as shown below. Multi-GPU support is enabled via accelerate.

Installation

pip install transformers bitsandbytes accelerate torch huggingface_hub

Example: Loading and Generating Text

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator

# Initialize Accelerator for multi-GPU support
accelerator = Accelerator()

# Specify the quantized model ID
model_id = "Tonic/c4ai-command-a-03-2025-4bit_nf4_double"  # Replace with desired variant
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)

# Prepare model for multi-GPU
model = accelerator.prepare(model)

# Format message with chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(accelerator.device)

# Generate text
gen_tokens = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.3,
)
gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

Notes

  • Device Mapping: device_map="auto" ensures the model is distributed across all available GPUs.
  • Compute Dtype: Adjust torch_dtype (e.g., torch.bfloat16 or torch.float16) based on your hardware and the quantization type.
  • Memory: Quantized models significantly reduce VRAM requirements compared to the original 111B parameter model, making them suitable for deployment on consumer-grade GPUs.

Quantization Details

The quantization process leverages bitsandbytes with the following configurations:

  • 4-bit Variants: Use nf4 or fp4 quantization types, with optional double quantization for improved precision.
  • 8-bit Variants: Offer standard or mixed precision options, with the latter supporting CPU offloading for additional flexibility.
  • Multi-GPU Optimization: The accelerate library handles model sharding and distribution, allowing deployment on systems with multiple GPUs.

For the exact quantization script, see this Gist (replace with a link to your script if hosted).


Model Card for C4AI Command A

Below is the original model card for C4AI Command A, adapted for this repository.


Model Summary

C4AI Command A is an open weights research release of a 111 billion parameter model optimized for demanding enterprises that require fast, secure, and high-quality AI. Compared to other leading proprietary and open-weights models, Command A delivers maximum performance with minimum hardware costs, excelling on business-critical agentic and multilingual tasks while being deployable on just two GPUs.

Try C4AI Command A

You can try the original model before downloading weights in the hosted Hugging Face Space.


Model Details

  • Input: Text only
  • Output: Text only
  • Model Architecture: Auto-regressive language model with an optimized transformer architecture, featuring sliding window attention (window size 4096) with RoPE, and a global attention layer without positional embeddings.
  • Languages: Supports 23 languages including English, French, Spanish, German, Japanese, Chinese, Arabic, and more (see full list in the original model card).
  • Context Length: 256K

Chat Capabilities

Command A is configured as a conversational model by default with two safety modes: contextual (default, fewer constraints) and strict (avoids sensitive topics). See Command A prompt format docs for details.


RAG Capabilities

Command A excels in Retrieval Augmented Generation (RAG) tasks. Use the apply_chat_template method with document snippets for RAG functionality. Example:

conversation = [{"role": "user", "content": "What has Man always dreamed of?"}]
documents = [
    {"heading": "The Moon", "body": "Man has always dreamed of destroying the moon..."},
    {"heading": "Love", "body": "Man's dream has always been to find love..."}
]
input_ids = tokenizer.apply_chat_template(conversation, documents=documents, tokenize=True, add_generation_prompt=True, return_tensors="pt")

Tool Use Capabilities

Command A supports conversational tool use with JSON schema-based tool descriptions. See the tool use example in the original model card for implementation details.


Code Capabilities

The model performs well on enterprise-relevant code tasks (e.g., SQL generation, code translation). Use low temperature or greedy decoding for optimal code generation.


Terms of Use

This model is released under a CC-BY-NC license for non-commercial use only, adhering to C4AI's Acceptable Use Policy. For commercial inquiries, contact Cohere’s Sales team.


Contact

For issues or questions, reach out to [email protected].


Downloads last month
47
Safetensors
Model size
58.8B params
Tensor type
BF16
·
F32
·
U8
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Model tree for Tonic/c4ai-command-a-03-2025-4bit_fp4

Quantized
(15)
this model

Space using Tonic/c4ai-command-a-03-2025-4bit_fp4 1