Spaces:

Dovakiins
/

qwerrwe

Build error

winglian commited on Sep 14, 2023

Commit

1157950

unverified ·

1 Parent(s): 3b18c96

remove columns after tokenizing for pretraining (#571)

Files changed (1) hide show

src/axolotl/utils/data.py CHANGED Viewed

@@ -644,8 +644,8 @@ def load_pretraining_dataset(path, tokenizer, max_tokens=2048, seed=42):
         encode,
         batched=True,
         input_columns="text",
-        remove_columns=[
-            "text",
-        ],
     )
     return dataset

         encode,
         batched=True,
         input_columns="text",
+        # remove all the existing columns after mapping since they end up having
+        # a different length than the encoded/tokenized column
+        remove_columns=dataset.features.keys(),
     )
     return dataset