hunterhector
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -36,26 +36,26 @@ We also performed the same finetuning on the last **CrystalCoder** checkpoint of
|
|
36 |
|
37 |
# Instruction Tuning Data
|
38 |
|
39 |
-
The
|
40 |
|
41 |
-
The summary of the
|
42 |
|
43 |
<!-- <center><img src="data_table.jpg" alt="Instruction Data"/></center> -->
|
44 |
-
| Subset | Tokens
|
45 |
-
| ----------- | ----------- |
|
46 |
-
| [OASST1-guanaco](https://huggingface.co/datasets/openaccess-ai-collective/oasst1-guanaco-extended-sharegpt) | 4
|
47 |
-
| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) |
|
48 |
-
| [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) | 112
|
49 |
-
| [Evol-ShareGPT](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 85
|
50 |
-
| [ChatLogs](https://huggingface.co/datasets/winglian/chatlogs-en-cleaned) | 29
|
51 |
-
| [CodeAlpaca](https://huggingface.co/datasets/lucasmccabe-lmi/CodeAlpaca-20k) | 2
|
52 |
-
| [Rosetta Code](https://github.com/sahil280114/codealpaca/blob/master/data/rosetta_alpaca.json) | 7
|
53 |
-
| [Evol-CodeAlpaca 1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 73
|
54 |
-
| [Evol-CodeAlpaca 2](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 34
|
55 |
-
| WebAlpaca | 43
|
56 |
-
| [General Textbooks](https://huggingface.co/datasets/open-phi/textbooks) | 85
|
57 |
-
| [Programming Books](https://huggingface.co/datasets/open-phi/programming_books_llama) | 395
|
58 |
-
| Total |
|
59 |
|
60 |
The HTML Instruction dataset was curated by LLM360 and will be made available shortly.
|
61 |
|
|
|
36 |
|
37 |
# Instruction Tuning Data
|
38 |
|
39 |
+
The fine-tuning data is a mix of publicly available language and code datasets, plus a orginally created dataset called **WebAlpaca**. The WebAlpaca dataset is created by us and is used as part of our instruction tuning training data. We will release the WebAlpaca dataset in a separate repository soon.
|
40 |
|
41 |
+
The summary of the fine-tuning data is as follows:
|
42 |
|
43 |
<!-- <center><img src="data_table.jpg" alt="Instruction Data"/></center> -->
|
44 |
+
| Subset | #Tokens | Avg. #Q | Avg. Query Len | Avg. #R | Avg. Reply Len |
|
45 |
+
| ----------- | ----------- |----------- |----------- |----------- |----------- |
|
46 |
+
| [OASST1-guanaco](https://huggingface.co/datasets/openaccess-ai-collective/oasst1-guanaco-extended-sharegpt) | 4,464,640 | 1.36 | 38.28 | 1.36 | 271.69 |
|
47 |
+
| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) |225,628,160 | 1.00 | 259.16 | 1.00 | 151.12 |
|
48 |
+
| [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) | 112,914,432 | 3.28 | 94.53 | 3.64 | 365.81 |
|
49 |
+
| [Evol-ShareGPT](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 85,954,560 | 1.00 | 145.99 | 1.00 | 425.17 |
|
50 |
+
| [ChatLogs](https://huggingface.co/datasets/winglian/chatlogs-en-cleaned) | 29,337,600 | 3.39 | 95.58 | 3.24 | 191.42 |
|
51 |
+
| [CodeAlpaca](https://huggingface.co/datasets/lucasmccabe-lmi/CodeAlpaca-20k) | 2,623,488 | 1.00 | 32.46 | 1.00 | 67.68 |
|
52 |
+
| [Rosetta Code](https://github.com/sahil280114/codealpaca/blob/master/data/rosetta_alpaca.json) | 7,987,200 | 1.00 | 450.09 | 1.00 | 533.52 |
|
53 |
+
| [Evol-CodeAlpaca 1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 73,803,776 | 1.00 | 210.33 | 1.00 | 437.92 |
|
54 |
+
| [Evol-CodeAlpaca 2](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 34,910,208 | 1.00 | 114.99 | 1.00 | 300.29 |
|
55 |
+
| WebAlpaca | 43,673,600 | 1.00 | 96.29 | 1.00 | 746.52 |
|
56 |
+
| [General Textbooks](https://huggingface.co/datasets/open-phi/textbooks) | 85,590,016 | Not instruction data
|
57 |
+
| [Programming Books](https://huggingface.co/datasets/open-phi/programming_books_llama) | 395,628,544 | Not instruction data
|
58 |
+
| Total | 1,102,516,224
|
59 |
|
60 |
The HTML Instruction dataset was curated by LLM360 and will be made available shortly.
|
61 |
|