Update README.md
Browse files
README.md
CHANGED
@@ -16,40 +16,9 @@ Who needs em, we all have em, they're just like us. Unusable models, compute opt
|
|
16 |
- C-Class Models: 76 x Million Params tokens in training set.
|
17 |
- D-Class Models: 142 x Million Params tokens in training set.
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
22 |
-
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
23 |
-
| **[GerbilLab/Gerbil-A-6.7m](https://hf.co/GerbilLab/Gerbil-A-6.7m)** | 6.7m | A-Class | 20 | 134M | 131k | 6.0741 | 26.04 | 614655.4052 | 0 | 51.7 |
|
24 |
-
| [GerbilLab/Gerbil-B-6.7m](https://hf.co/GerbilLab/Gerbil-B-6.7m) | 6.7m | B-Class | 42 | 281M | 131k | 5.5132 | 25.74 | 370243.6771 | 0 | 52.64 |
|
25 |
-
| [GerbilLab/Gerbil-C-6.7m](https://hf.co/GerbilLab/Gerbil-C-6.7m) | 6.7m | C-Class | 76 | 509M | 131k | 5.1098 | 25.54 | 199753.1491 | 0 | 52.72 |
|
26 |
-
| [GerbilLab/Gerbil-D-6.7m](https://hf.co/GerbilLab/Gerbil-D-6.7m) | 6.7m | D-Class | 142 | 852M | 131k | 4.8186 | 25.32 | 127810.4082 | 0 | 52.88 |
|
27 |
-
| **[GerbilLab/Gerbil-A-15m](https://hf.co/GerbilLab/Gerbil-A-15m)** | 15m | A-Class | 20 | 280M | 131k | 4.9999 | 25.56 | 190773.1317 | 0 | 52.17 |
|
28 |
-
| **[GerbilLab/Gerbil-A-32m](https://hf.co/GerbilLab/Gerbil-A-32m)** | 32m | A-Class | 20 | 640M | 262K | 4.0487 | 25.9 | 19358.5197 | 0.83 | 51.14 |
|
29 |
-
| --- | --- | --- | --- | --- | --- | --- |
|
30 |
-
| **[GerbilLab/GerbilBlender-A-6.7m](https://hf.co/GerbilLab/GerbilBlender-A-6.7m)** | 6.7m | A-Class | 20 | 134M | 131k | 6.0908 | 26 | 627506.0021 | 0 | 50.12 |
|
31 |
-
| **[GerbilLab/GerbilBlender-D-6.7m](https://hf.co/GerbilLab/GerbilBlender-D-6.7m)** | 6.7m | D-Class | 142 | 852M | 131k |4.838100 | 25.55 | 151087.4082 | 0.02 | 51.3 |
|
32 |
-
| **[GerbilLab/GerbilBlender-E-star-6.7m](https://hf.co/GerbilLab/GerbilBlender-E-star-6.7m)** | 6.7m | E*-Class | 284 | 1704M | 131k-262k | 4.7547 | 25.47 | 122411.0728 | 0.06 | 50.59 |
|
33 |
-
| **[GerbilLab/GerbilBlender-A-15m](https://hf.co/GerbilLab/GerbilBlender-A-15m)** | 15m | A-Class | 20 | 280M | 131k | 4.9642 | 25.8 | 203830.1811 | 0 | 49.96 |
|
34 |
-
| **[GerbilLab/GerbilBlender-A-32m](https://hf.co/GerbilLab/GerbilBlender-A-32m)** | 32m | A-Class | 20 | 640M | 262K | 4.127 | 25.81 | 16491.0141 | 2.66 | 51.93 |
|
35 |
-
| **[GerbilLab/GerbilBlender-A-77m](https://hf.co/GerbilLab/GerbilBlender-A-77m)** | 77m | A-Class | 20 | 1520M | 262K | 3.3334 | 26.06 | 1908.0661 | 18.2 | 52.09 |
|
36 |
-
| **[GerbilLab/GerbilBlender-B-star-77m](https://hf.co/GerbilLab/GerbilBlender-B-star-77m)** | 77m | B*-Class | 40 | 3040M | 262K-524K | 3.1879 | 26.33 | 1766.5002 | 18.24 | 53.43 |
|
37 |
-
| **[GerbilLab/GerbilBlender-C-star-77m](https://hf.co/GerbilLab/GerbilBlender-C-star-77m)** | 77m | C*-Class | 60 | 4560M | 262K-524K | coming soon |
|
38 |
-
| **[GerbilLab/GerbilBlender-A-104m](https://hf.co/GerbilLab/GerbilBlender-A-104m)** | 104m | A-Class | 20 | 2060M | 1M (too big, would outperform a-77 if was using 524k)| 3.592 | 26.41 | 2972.4260 | 17.31 | 49.6 |
|
39 |
-
| --- | --- | --- | --- | --- | --- | --- |
|
40 |
-
<!---
|
41 |
-
| [GerbilLab/T5Blender-A-24m](https://hf.co/GerbilLab/T5Blender-A-24m) | 24m | A-class | 20 | 460M | 131K | 5.5642 | 25.85 | 57122770.9237 | 0 | 52.25 |
|
42 |
-
| [GerbilLab/T5Blender-B-star-24m](https://hf.co/GerbilLab/T5Blender-B-star-24m) | 24m | B*-Class | 40 | 920M | 131K-262K | 5.419 |
|
43 |
-
| [GerbilLab/T5Blender-C-star-24m](https://hf.co/GerbilLab/T5Blender-C-star-24m) | 24m | C*-Class | 60 | 1380M | 131K-262K | coming soon |
|
44 |
-
|
45 |
-
--->
|
46 |
-
Scores to beat:
|
47 |
-
|
48 |
-
| Model Name | Parameters | Tokens | hellaswag | lambada ppl | lambada acc | winogrande acc |
|
49 |
-
| --- | --- | --- | --- | --- | --- | --- |
|
50 |
-
| EleutherAI/pythia-70m-deduped | 70m | 300B(?) |27.36 | 90.5683 | 25.25 | 52.25 |
|
51 |
-
|
52 |
-
Nearly every base model that isn't finetuned for a specific task was trained on the deduplicated Pile dataset, and is a Decoder-only model. "Blender" models, inspired by UL2 pretraining, are trained equally in fill-in-the-middle, causal modelling, and masked language modelling tasks. Special tokens for these models include:
|
53 |
|
54 |
```
|
55 |
'<fitm_start>', '<multiple_tok_mask>', '<fitm_result>', '<causal>', '<mlm_start>', '<single_tok_mask>', '<mlm_end>'
|
@@ -63,10 +32,4 @@ Nearly every base model that isn't finetuned for a specific task was trained on
|
|
63 |
# Example masked language modelling
|
64 |
'<mlm_start> this is an <single_tok_mask> text for masked language modelling <mlm_end> example <|endoftext|>'
|
65 |
|
66 |
-
```
|
67 |
-
|
68 |
-
Some applications where I can imagine these being useful are: warm-starting very small encoder-decoder models, fitting a new scaling law that takes into account smaller models, or having a "fuzzy wrapper" around an API. They also could be usable on their own (for classification or other) when finetuned on more specific datasets. I don't expect the 3.3m models to be useful for any task whatsoever. Every model was trained on a singular GPU, either a RTX2060, RTX3060, or a T4.
|
69 |
-
|
70 |
-
I'd , uh , appreciate help in evaluating all these models probably with lm harness!!
|
71 |
-
|
72 |
-
Other "small-scale" models that are not Gerbil/Blender but still beneficial to low-resource neural computing that I create will be uploaded here as well.
|
|
|
16 |
- C-Class Models: 76 x Million Params tokens in training set.
|
17 |
- D-Class Models: 142 x Million Params tokens in training set.
|
18 |
|
19 |
+
Evaluations for every Gerbil model can be found here: https://github.com/aicrumb/notebook-hosting/blob/main/GerbilLabEvaluations.md
|
20 |
|
21 |
+
Special tokens for "Blender" models' pretraining include:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
```
|
24 |
'<fitm_start>', '<multiple_tok_mask>', '<fitm_result>', '<causal>', '<mlm_start>', '<single_tok_mask>', '<mlm_end>'
|
|
|
32 |
# Example masked language modelling
|
33 |
'<mlm_start> this is an <single_tok_mask> text for masked language modelling <mlm_end> example <|endoftext|>'
|
34 |
|
35 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|