OpenMOSE
/

HRWKV7-Reka-Flash3-Preview

Text Generation

linear-attention

knowledge-distillation

Model card Files Files and versions

OpenMOSE commited on Jul 6

Commit

88fb0d7

·

verified ·

1 Parent(s): 3bb3693

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -7,6 +7,10 @@ license: apache-2.0
   <img src="./hxa079.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover;  box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
 </div>
 ### Model Description
 HRWKV7-Reka-Flash3-Preview is an experimental hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Reka-flash3 21B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.

   <img src="./hxa079.png" style="border-radius: 15px; width: 60%; height: 60%; object-fit: cover;  box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="PRWKV" />
 </div>
+> I'm simply exploring the possibility of linearizing existing Transformer models.
+> It's still far from perfect,
+> but I hope you'll bear with me as I continue this journey.
 ### Model Description
 HRWKV7-Reka-Flash3-Preview is an experimental hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Reka-flash3 21B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.