Commit History
streaming multipack for pretraining dataset (#959)
553c80f
unverified
fix: revert local dir dataset load (#878)
575a082
unverified
Nanobit
commited on
don't train if eval split is too small (#873)
797f3dd
unverified
winglian
commited on
Feat: Add dataset loading from S3, GCS (#765)
3cc67d2
unverified
Nanobit
commited on
cleanup the old multipack dataloader (#841)
1a6309c
unverified
winglian
commited on
multipack w batch sampler (#795)
641e6f7
unverified
winglian
commited on
update table for rwkv4 support, fix process count for dataset (#822)
cdc71f7
unverified
winglian
commited on
Create preprocess CLI (#785)
e50ab07
unverified
casperhansen
commited on
catch ConnectionError when checking dataset from HuggingFace (#743)
992d57f
unverified
Napuh
commited on
improve handling of the prepared ds path and other cfg defaults (#701)
1c412c7
unverified
winglian
commited on
Fix: Future deprecation warning with use_auth_token (#680)
69fac9a
unverified
Nanobit
commited on
prepared dataset caching, other misc fixes (#665)
e50a64e
unverified
winglian
commited on
add support for defined train split (#654)
409ca0f
unverified
winglian
commited on
Fix bug in dataset loading (#284)
8fe0e63
unverified
ethanhs
commited on
use fastchat conversations template (#578)
e7d3e2d
unverified
winglian
commited on
attention_mask not needed for training (#642)
e8cbf50
unverified
winglian
commited on
Feat(data): Allow loading local csv and text (#594)
00dce35
unverified
Nanobit
commited on
support custom field for completion from yml (#580)
f7a2263
unverified
winglian
commited on
remove columns after tokenizing for pretraining (#571)
1157950
unverified
winglian
commited on
Fix pretraining with iterable/streaming Dataset (#556)
2f586d1
unverified
Jan Philipp Harries
Jan Philipp Harries
commited on
workaround for md5 variations (#533)
0b4cf5b
unverified
winglian
commited on
support for datasets with multiple names (#480)
5ac3392
unverified
winglian
commited on
improve llama pad token handling (#475)
cb9797e
unverified
winglian
commited on
support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)
d2e7f27
unverified
winglian
commited on
add utils.data.prepare_dataset
2e22404
tmm1
commited on
use context manager to run things on rank0 before others (#397)
fc2d6be
unverified
winglian
commited on
Attention mask and position id fixes for packing (#285)
2bb0b78
unverified
winglian
commited on
experimental llama 2 chat support (#296)
3392270
unverified
Jan Philipp Harries
Jan Philipp Harries
commited on
optimize the iteration when tokenizeing large datasets (#332)
fe28543
unverified
winglian
commited on
Merge pull request #276 from theobjectivedad/logging_enhancement
6f16c45
unverified
winglian
commited on
Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var
b1f4f7a
theobjectivedad
commited on
Add ability to pass 'name' argument to load_dataset
88089e8
chargoddard
commited on
Adding logging enhancement
553a86b
theobjectivedad
commited on
Support loading data files from a local directory
9bdd30c
utensil
commited on
Merge branch 'main' into flash-optimum
fd2c981
unverified
winglian
commited on
add new sharegpt, refactor prompt so it can be customized later, add exception if no data is processed
aac4b76
winglian
commited on
address PR feedback
0c6f928
winglian
commited on
add streaming dataset support for pretraining datasets
eea2731
winglian
commited on
more gpt-neox long ctx fixes
ab5cd28
winglian
commited on
more tweaks to do pre-training with bettertransformers
1210dc8
winglian
commited on
experimental expansion of ctx len
488a67d
winglian
commited on
Set to use cfg.seed or 42 for backward compat
2cfe9e9
Nanobit
commited on
fix batch size calculation
5a631b3
winglian
commited on
Fix security issue or ignore false positives
a1f9850
Nanobit
commited on
Apply isort then black
37293dc
Nanobit
commited on
Fix mypy typing
e9650d3
Nanobit
commited on
Black formatting
b832a0a
Nanobit
commited on
Refactor
4c0eddb
Nanobit
commited on