OpenMOSE
/

HRWKV7-Reka-Flash3-Preview

Text Generation

linear-attention

knowledge-distillation

Model card Files Files and versions

OpenMOSE commited on Jul 6

Commit

0ba6ed3

·

verified ·

1 Parent(s): 9730e0e

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ The model implements several key improvements over standard RWKV architectures:
 ### Hybrid Design Benefits
-- **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and hybrids reduce memory usage by 1/7 KVCache.
 - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
 - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities

 ### Hybrid Design Benefits
+- **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/7 of full GQA.
 - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
 - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities