Commit History
prepared dataset caching, other misc fixes (#665)
e50a64e
unverified
add support for defined train split (#654)
409ca0f
unverified
Fix bug in dataset loading (#284)
8fe0e63
unverified
use fastchat conversations template (#578)
e7d3e2d
unverified
attention_mask not needed for training (#642)
e8cbf50
unverified
Feat(data): Allow loading local csv and text (#594)
00dce35
unverified
support custom field for completion from yml (#580)
f7a2263
unverified
remove columns after tokenizing for pretraining (#571)
1157950
unverified
Fix pretraining with iterable/streaming Dataset (#556)
2f586d1
unverified
Jan Philipp Harries
Jan Philipp Harries
commited on
workaround for md5 variations (#533)
0b4cf5b
unverified
support for datasets with multiple names (#480)
5ac3392
unverified
improve llama pad token handling (#475)
cb9797e
unverified
support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)
d2e7f27
unverified
add utils.data.prepare_dataset
2e22404
use context manager to run things on rank0 before others (#397)
fc2d6be
unverified
Attention mask and position id fixes for packing (#285)
2bb0b78
unverified
experimental llama 2 chat support (#296)
3392270
unverified
Jan Philipp Harries
Jan Philipp Harries
commited on