Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,43 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
# Reka Flash 3.1
|
5 |
+
At Reka, we build intelligence from the ground up to power our multimodal platform. From atomic capabilities such as [visual understanding](https://www.reka.ai/news/reka-visual-language-models) and [reasoning to use tools](https://www.reka.ai/news/introducing-reka-flash) to system-level optimization to serve them at scale.
|
6 |
+
|
7 |
+
Today, we are excited to open source a few of our building blocks:
|
8 |
+
**Reka Flash 3.1**, an improved version of Reka Flash 3 due to significant advances in our reinforcement learning stack. Reka Flash 3.1 is particularly strong on coding and as a base model to be finetuned on agentic tasks.
|
9 |
+
**Reka Flash Quantized (Reka Quant)**, a 3.5-bit quantized version of Reka Flash 3.1 that delivers state-of-the-art performance at low bitwidths using calibrated error reduction and self distillation.
|
10 |
+
Our quantization library supports self-distillation, fast distributed proxy Hessian computation for fast LDLQ, and export to popular llama.cpp datatypes such as Q3_K and Q4_K.
|
11 |
+
|
12 |
+
|
13 |
+
|
14 |
+
Reka Flash 3.1 improves by 10 points on LiveCodeBench v5 (Full set) from Reka Flash 3. For coding related tasks, Reka Flash 3.1 is competitive with models such as Qwen3-32B. o3-mini, and Gemini 2.5 Flash Thinking. If you want to learn more about how we do reinforcement learning for Reka Flash 3.1 that results in these improvements, please check out this post.
|
15 |
+
|
16 |
+
While Reka Flash 3.1 is already compact as a 21 billion parameter model, quantization allows us to reduce its memory footprint even further, allowing it to work in resource-constrainted settings and be served cost efficiently. Reka Quant achieves near-lossless quantization to 3.5 bits when quantizing Reka Flash 3.1 to Q3_K_S datatype in llama.cpp, incurring only a 1.6 average performance degradation. In contrast, Q3_K_S quantization routine results in a 6.8 average performance degradation. We provide a more detailed discussion about our quantization approach in this post.
|
17 |
+
|
18 |
+
Strong reasoning and coding skills are important capabilities to support multimodal agentic use cases, and near-lossless quantization allows us to deploy our models anywhere. A multimodal version of Reka-Flash-3.1 serves as a base model for our core products Reka Research and Reka Vision. Please contact us for more information about how you can use them in your organizations.
|
19 |
+
|
20 |
+

|
21 |
+
|
22 |
+
## Quick Start
|
23 |
+
For ease of deployment, Reka Flash 3.1 is released in a Llama-compatible format. You may use any library compatible with Llama to run the model.
|
24 |
+
|
25 |
+
### Via Huggingface
|
26 |
+
```python
|
27 |
+
import transformers
|
28 |
+
model_name = "RekaAI/reka-flash-3.1"
|
29 |
+
|
30 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
|
31 |
+
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, torch_dtype='auto', device_map='auto')
|
32 |
+
|
33 |
+
prompt = {"role": "human", "content": "Write a poem about large language model."}
|
34 |
+
text = tokenizer.apply_chat_template([prompt], tokenize=False, add_generation_prompt=True)
|
35 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
36 |
+
outputs = model.generate(**model_inputs, max_new_tokens=65536)
|
37 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
38 |
+
```
|
39 |
+
|
40 |
+
### Via vLLM
|
41 |
+
```shell
|
42 |
+
docker run --rm -it --network=host --gpus '"device=0"' -v --shm-size=10.24gb vllm/vllm-openai:latest serve RekaAI/reka-flash-3 --dtype auto -tp 1
|
43 |
+
```
|