C4AI Command A - Quantized Models

This repository contains quantized versions of the C4AI Command A model, an open weights research release by Cohere and Cohere For AI. The original model is a 111 billion parameter language model optimized for enterprise use cases, excelling in agentic, multilingual, and retrieval-augmented generation (RAG) tasks while being deployable on minimal hardware (e.g., two GPUs). Here, we provide multiple quantized variants to further reduce memory footprint and enhance deployment flexibility across various hardware setups, including multi-GPU environments.

For details on the original model, refer to the official model card below.

Quantized Models

We have quantized the original CohereForAI/c4ai-command-a-03-2025 model using the bitsandbytes library with various configurations to balance performance, memory efficiency, and accuracy. Below are the available quantized versions:

Quantization Type	Description	Compute Dtype	Double Quantization	Notes
`4bit_nf4_double`	4-bit quantization with `nf4` (Normal Float 4)	`bfloat16`	Yes	High precision with reduced memory usage
`4bit_fp4`	4-bit quantization with `fp4` (Float Point 4)	`bfloat16`	No	Lightweight, slightly less precise
`8bit_standard`	Standard 8-bit quantization	`bfloat16`	N/A	Balanced memory and accuracy
`8bit_mixed`	8-bit quantization with mixed precision and CPU offloading capability	`float16`	N/A	Flexible for constrained environments
`4bit_nf4_no_double`	4-bit quantization with `nf4`, no double quantization	`bfloat16`	No	Minimal memory footprint

These models are optimized for multi-GPU deployment using the accelerate library, ensuring efficient distribution across available GPUs. Each variant is hosted in its own sub-repository:

Tonic/c4ai-command-a-03-2025-4bit_nf4_double
Tonic/c4ai-command-a-03-2025-4bit_fp4
Tonic/c4ai-command-a-03-2025-8bit_standard
Tonic/c4ai-command-a-03-2025-8bit_mixed
Tonic/c4ai-command-a-03-2025-4bit_nf4_no_double

Usage

To use a quantized model, install the required dependencies and load the desired variant as shown below. Multi-GPU support is enabled via accelerate.

Installation

pip install transformers bitsandbytes accelerate torch huggingface_hub

Example: Loading and Generating Text

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator

# Initialize Accelerator for multi-GPU support
accelerator = Accelerator()

# Specify the quantized model ID
model_id = "Tonic/c4ai-command-a-03-2025-4bit_nf4_double"  # Replace with desired variant
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)

# Prepare model for multi-GPU
model = accelerator.prepare(model)

# Format message with chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(accelerator.device)

# Generate text
gen_tokens = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.3,
)
gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

Notes

Device Mapping: device_map="auto" ensures the model is distributed across all available GPUs.
Compute Dtype: Adjust torch_dtype (e.g., torch.bfloat16 or torch.float16) based on your hardware and the quantization type.
Memory: Quantized models significantly reduce VRAM requirements compared to the original 111B parameter model, making them suitable for deployment on consumer-grade GPUs.

Quantization Details

The quantization process leverages bitsandbytes with the following configurations:

4-bit Variants: Use nf4 or fp4 quantization types, with optional double quantization for improved precision.
8-bit Variants: Offer standard or mixed precision options, with the latter supporting CPU offloading for additional flexibility.
Multi-GPU Optimization: The accelerate library handles model sharding and distribution, allowing deployment on systems with multiple GPUs.

For the exact quantization script, see this Gist (replace with a link to your script if hosted).

Model Card for C4AI Command A

Below is the original model card for C4AI Command A, adapted for this repository.

Model Summary

C4AI Command A is an open weights research release of a 111 billion parameter model optimized for demanding enterprises that require fast, secure, and high-quality AI. Compared to other leading proprietary and open-weights models, Command A delivers maximum performance with minimum hardware costs, excelling on business-critical agentic and multilingual tasks while being deployable on just two GPUs.

Developed by: Cohere and Cohere For AI
Point of Contact: Cohere For AI: cohere.for.ai
License: CC-BY-NC, requires adhering to C4AI's Acceptable Use Policy
Model: c4ai-command-a-03-2025
Model Size: 111 billion parameters
Context Length: 256K

Try C4AI Command A

You can try the original model before downloading weights in the hosted Hugging Face Space.

Model Details

Input: Text only
Output: Text only
Model Architecture: Auto-regressive language model with an optimized transformer architecture, featuring sliding window attention (window size 4096) with RoPE, and a global attention layer without positional embeddings.
Languages: Supports 23 languages including English, French, Spanish, German, Japanese, Chinese, Arabic, and more (see full list in the original model card).
Context Length: 256K

Chat Capabilities

Command A is configured as a conversational model by default with two safety modes: contextual (default, fewer constraints) and strict (avoids sensitive topics). See Command A prompt format docs for details.

RAG Capabilities

Command A excels in Retrieval Augmented Generation (RAG) tasks. Use the apply_chat_template method with document snippets for RAG functionality. Example:

conversation = [{"role": "user", "content": "What has Man always dreamed of?"}]
documents = [
    {"heading": "The Moon", "body": "Man has always dreamed of destroying the moon..."},
    {"heading": "Love", "body": "Man's dream has always been to find love..."}
]
input_ids = tokenizer.apply_chat_template(conversation, documents=documents, tokenize=True, add_generation_prompt=True, return_tensors="pt")

Tool Use Capabilities

Command A supports conversational tool use with JSON schema-based tool descriptions. See the tool use example in the original model card for implementation details.

Code Capabilities

The model performs well on enterprise-relevant code tasks (e.g., SQL generation, code translation). Use low temperature or greedy decoding for optimal code generation.

Terms of Use

This model is released under a CC-BY-NC license for non-commercial use only, adhering to C4AI's Acceptable Use Policy. For commercial inquiries, contact Cohere’s Sales team.

Contact

For issues or questions, reach out to [email protected].

Tonic
/

c4ai-command-a-03-2025-4bit_fp4