Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
# Linear_Tiny_87M
|
6 |
+
|
7 |
+
## Introduction
|
8 |
+
|
9 |
+
Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets.
|
10 |
+
As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). For more detail, refer to the
|
11 |
+
[paper](https://arxiv.org/abs/2405.06640)
|
12 |
+
|
13 |
+
|
14 |
+
Linear_Tiny_87M is a linear model that has been trained on a subset of redpajama dataset for 1 epoch on **1x A4000**. It took almost 4 hours for training to be completed.
|
15 |
+
|
16 |
+
## Usage
|
17 |
+
|
18 |
+
Just download the checkpoint and afterwards run the following code snippet:
|
19 |
+
|
20 |
+
'''
|
21 |
+
cd scripts
|
22 |
+
|
23 |
+
python generate.py \
|
24 |
+
--model open_lm_87m \
|
25 |
+
--checkpoint /path/to/checkpoint.pt \
|
26 |
+
--positional-embedding-type head_rotary \
|
27 |
+
--input-text "Machine Learning is a"
|
28 |
+
|
29 |
+
'''
|