|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# Linear_Tiny_87M |
|
|
|
## Introduction |
|
|
|
Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. |
|
As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). For more detail, refer to the |
|
[paper](https://arxiv.org/abs/2405.06640) |
|
|
|
|
|
Linear_Tiny_87M is a linear model that has been trained on a subset of redpajama dataset for 1 epoch on **1x A4000**. It took almost 4 hours for training to be completed. |
|
|
|
## Usage |
|
|
|
Just download the checkpoint and afterwards run the following code snippet: |
|
|
|
```python |
|
cd scripts |
|
|
|
python generate.py \ |
|
--model open_lm_87m \ |
|
--checkpoint /path/to/checkpoint.pt \ |
|
--positional-embedding-type head_rotary \ |
|
--input-text "Machine Learning is a" |
|
|
|
``` |
|
|