crumb commited on
Commit
fa26741
Β·
1 Parent(s): adc8d42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -40
README.md CHANGED
@@ -16,40 +16,9 @@ Who needs em, we all have em, they're just like us. Unusable models, compute opt
16
  - C-Class Models: 76 x Million Params tokens in training set.
17
  - D-Class Models: 142 x Million Params tokens in training set.
18
 
19
- The B, C, and D classes are derived from the tokens per model ratio from LLaMA, as LLaMA 65B is nearly Chinchilla-optimal with a ratio of 21 x Million Params tokens in training. Descending down the model sizes per training set for each model gives us these classes. Any "star" class model is approximately the class but not exactly, and finetuned from a previous checkpoint for time-saving purposes.
20
 
21
- | Model Name | Parameters | Class | Ratio | Tokens | Batch Size (Tokens) | Training Loss ↓ | hellaswag acc ↑ | lambada ppl ↓ | lambada acc ↑ | winogrande acc ↑
22
- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
23
- | **[GerbilLab/Gerbil-A-6.7m](https://hf.co/GerbilLab/Gerbil-A-6.7m)** | 6.7m | A-Class | 20 | 134M | 131k | 6.0741 | 26.04 | 614655.4052 | 0 | 51.7 |
24
- | [GerbilLab/Gerbil-B-6.7m](https://hf.co/GerbilLab/Gerbil-B-6.7m) | 6.7m | B-Class | 42 | 281M | 131k | 5.5132 | 25.74 | 370243.6771 | 0 | 52.64 |
25
- | [GerbilLab/Gerbil-C-6.7m](https://hf.co/GerbilLab/Gerbil-C-6.7m) | 6.7m | C-Class | 76 | 509M | 131k | 5.1098 | 25.54 | 199753.1491 | 0 | 52.72 |
26
- | [GerbilLab/Gerbil-D-6.7m](https://hf.co/GerbilLab/Gerbil-D-6.7m) | 6.7m | D-Class | 142 | 852M | 131k | 4.8186 | 25.32 | 127810.4082 | 0 | 52.88 |
27
- | **[GerbilLab/Gerbil-A-15m](https://hf.co/GerbilLab/Gerbil-A-15m)** | 15m | A-Class | 20 | 280M | 131k | 4.9999 | 25.56 | 190773.1317 | 0 | 52.17 |
28
- | **[GerbilLab/Gerbil-A-32m](https://hf.co/GerbilLab/Gerbil-A-32m)** | 32m | A-Class | 20 | 640M | 262K | 4.0487 | 25.9 | 19358.5197 | 0.83 | 51.14 |
29
- | --- | --- | --- | --- | --- | --- | --- |
30
- | **[GerbilLab/GerbilBlender-A-6.7m](https://hf.co/GerbilLab/GerbilBlender-A-6.7m)** | 6.7m | A-Class | 20 | 134M | 131k | 6.0908 | 26 | 627506.0021 | 0 | 50.12 |
31
- | **[GerbilLab/GerbilBlender-D-6.7m](https://hf.co/GerbilLab/GerbilBlender-D-6.7m)** | 6.7m | D-Class | 142 | 852M | 131k |4.838100 | 25.55 | 151087.4082 | 0.02 | 51.3 |
32
- | **[GerbilLab/GerbilBlender-E-star-6.7m](https://hf.co/GerbilLab/GerbilBlender-E-star-6.7m)** | 6.7m | E*-Class | 284 | 1704M | 131k-262k | 4.7547 | 25.47 | 122411.0728 | 0.06 | 50.59 |
33
- | **[GerbilLab/GerbilBlender-A-15m](https://hf.co/GerbilLab/GerbilBlender-A-15m)** | 15m | A-Class | 20 | 280M | 131k | 4.9642 | 25.8 | 203830.1811 | 0 | 49.96 |
34
- | **[GerbilLab/GerbilBlender-A-32m](https://hf.co/GerbilLab/GerbilBlender-A-32m)** | 32m | A-Class | 20 | 640M | 262K | 4.127 | 25.81 | 16491.0141 | 2.66 | 51.93 |
35
- | **[GerbilLab/GerbilBlender-A-77m](https://hf.co/GerbilLab/GerbilBlender-A-77m)** | 77m | A-Class | 20 | 1520M | 262K | 3.3334 | 26.06 | 1908.0661 | 18.2 | 52.09 |
36
- | **[GerbilLab/GerbilBlender-B-star-77m](https://hf.co/GerbilLab/GerbilBlender-B-star-77m)** | 77m | B*-Class | 40 | 3040M | 262K-524K | 3.1879 | 26.33 | 1766.5002 | 18.24 | 53.43 |
37
- | **[GerbilLab/GerbilBlender-C-star-77m](https://hf.co/GerbilLab/GerbilBlender-C-star-77m)** | 77m | C*-Class | 60 | 4560M | 262K-524K | coming soon |
38
- | **[GerbilLab/GerbilBlender-A-104m](https://hf.co/GerbilLab/GerbilBlender-A-104m)** | 104m | A-Class | 20 | 2060M | 1M (too big, would outperform a-77 if was using 524k)| 3.592 | 26.41 | 2972.4260 | 17.31 | 49.6 |
39
- | --- | --- | --- | --- | --- | --- | --- |
40
- <!---
41
- | [GerbilLab/T5Blender-A-24m](https://hf.co/GerbilLab/T5Blender-A-24m) | 24m | A-class | 20 | 460M | 131K | 5.5642 | 25.85 | 57122770.9237 | 0 | 52.25 |
42
- | [GerbilLab/T5Blender-B-star-24m](https://hf.co/GerbilLab/T5Blender-B-star-24m) | 24m | B*-Class | 40 | 920M | 131K-262K | 5.419 |
43
- | [GerbilLab/T5Blender-C-star-24m](https://hf.co/GerbilLab/T5Blender-C-star-24m) | 24m | C*-Class | 60 | 1380M | 131K-262K | coming soon |
44
-
45
- --->
46
- Scores to beat:
47
-
48
- | Model Name | Parameters | Tokens | hellaswag | lambada ppl | lambada acc | winogrande acc |
49
- | --- | --- | --- | --- | --- | --- | --- |
50
- | EleutherAI/pythia-70m-deduped | 70m | 300B(?) |27.36 | 90.5683 | 25.25 | 52.25 |
51
-
52
- Nearly every base model that isn't finetuned for a specific task was trained on the deduplicated Pile dataset, and is a Decoder-only model. "Blender" models, inspired by UL2 pretraining, are trained equally in fill-in-the-middle, causal modelling, and masked language modelling tasks. Special tokens for these models include:
53
 
54
  ```
55
  '<fitm_start>', '<multiple_tok_mask>', '<fitm_result>', '<causal>', '<mlm_start>', '<single_tok_mask>', '<mlm_end>'
@@ -63,10 +32,4 @@ Nearly every base model that isn't finetuned for a specific task was trained on
63
  # Example masked language modelling
64
  '<mlm_start> this is an <single_tok_mask> text for masked language modelling <mlm_end> example <|endoftext|>'
65
 
66
- ```
67
-
68
- Some applications where I can imagine these being useful are: warm-starting very small encoder-decoder models, fitting a new scaling law that takes into account smaller models, or having a "fuzzy wrapper" around an API. They also could be usable on their own (for classification or other) when finetuned on more specific datasets. I don't expect the 3.3m models to be useful for any task whatsoever. Every model was trained on a singular GPU, either a RTX2060, RTX3060, or a T4.
69
-
70
- I'd , uh , appreciate help in evaluating all these models probably with lm harness!!
71
-
72
- Other "small-scale" models that are not Gerbil/Blender but still beneficial to low-resource neural computing that I create will be uploaded here as well.
 
16
  - C-Class Models: 76 x Million Params tokens in training set.
17
  - D-Class Models: 142 x Million Params tokens in training set.
18
 
19
+ Evaluations for every Gerbil model can be found here: https://github.com/aicrumb/notebook-hosting/blob/main/GerbilLabEvaluations.md
20
 
21
+ Special tokens for "Blender" models' pretraining include:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ```
24
  '<fitm_start>', '<multiple_tok_mask>', '<fitm_result>', '<causal>', '<mlm_start>', '<single_tok_mask>', '<mlm_end>'
 
32
  # Example masked language modelling
33
  '<mlm_start> this is an <single_tok_mask> text for masked language modelling <mlm_end> example <|endoftext|>'
34
 
35
+ ```