Merge-Effect - a pietrolesci Collection

pietrolesci 's Collections

The Pile Datasets

Generalisation-Profiles

Machine Translation Datasets

Text Classification Datasets

Dialogue State Tracking Datasets

NLI Eval Datasets

AnchorAL

Memorisation-Profiles

Merge-Effect

updated 6 days ago

Upvote

JeanKaddour/minipile

Viewer • Updated Jun 20, 2023 • 1.01M • 2.7k • 120

Note Original dataset used to train tokenisers and models.
pietrolesci/tokenisers

Updated 26 days ago

Note Tokenisers trained on the MiniPile. The `_raw_tokenisers` folder contains the original tokenisers trained with a vocabulary size of 320k. Then, each folder is a `transformers`-compatible tokeniser of a smaller size.
pietrolesci/minipile

Viewer • Updated 26 days ago • 6.06M • 419

Note Tokenised MiniPile dataset(s). Each split correponds to a tokeniser in `pietrolesci/tokenisers`.
pietrolesci/finewebedu-20B

Viewer • Updated 9 days ago • 40.4M • 161

Note This dataset is a subset of the `fineweb-edu/sample/100BT` dataset containing the initial 20.2M documents (or roughly 20B tokens). The default configuration is the raw data. Any other configuration corresponds to the tokenised version, where the configuration name corresponds to the tokeniser used to tokenise it (see `pietrolesci/tokenisers`).
pietrolesci/me57M-tied_minipile_bpe8064minipile

Updated 11 days ago

Note Model trained for 50k steps on the MiniPile dataset. Each branch is a different checkpoint saved each 2k steps.
pietrolesci/me57M-tied_minipile_bpe32000minipile

Updated 11 days ago

Note Model trained for 50k steps on the MiniPile dataset. Each branch is a different checkpoint saved each 2k steps.
pietrolesci/me57M-tied_minipile_bpe128000minipile

Updated 11 days ago

Note Model trained for 50k steps on the MiniPile dataset. Each branch is a different checkpoint saved each 2k steps.
pietrolesci/me57M-tied_minipile_wordpiece32000minipile

Updated 11 days ago

Note Model trained for 50k steps on the MiniPile dataset. Each branch is a different checkpoint saved each 2k steps.
pietrolesci/me57M-tied_minipile_bpe2wp32000minipile

Updated 11 days ago

Note Model trained for 50k steps on the MiniPile dataset. Each branch is a different checkpoint saved every 2k steps. The bpe2wp nomenclature means that we choose the merges using the BPE objective, and we tokenised the MiniPile using the resulting vocabulary and the WordPiece tokenisation function (i.e., longest prefix match).
pietrolesci/me340M-tied_minipile_bpe32000minipile

Updated 11 days ago • 56

Note Model trained for 50k steps on the MiniPile dataset. Each branch is a different checkpoint saved each 2k steps.
pietrolesci/me850M_minipile_bpe32000minipile

Updated 11 days ago • 54

Note Model trained for 50k steps on the MiniPile dataset. Each branch is a different checkpoint saved each 2k steps.
pietrolesci/me-minipile-evals

Viewer • Updated 11 days ago • 1.82M • 140

Note Log-probabilities computed on the validation set of the MiniPile dataset using the models above.
pietrolesci/me100M-tied_finewebedu-20B_bpe32000minipile

Updated 6 days ago • 51

Upvote