hunterhector commited on
Commit
3c4b9d4
·
verified ·
1 Parent(s): a8e8433

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -17
README.md CHANGED
@@ -36,26 +36,26 @@ We also performed the same finetuning on the last **CrystalCoder** checkpoint of
36
 
37
  # Instruction Tuning Data
38
 
39
- The instruction tuning data is a mix of publicly available language and code datasets, plus a orginally created dataset called **WebAlpaca**. The WebAlpaca dataset is created by us and is used as part of our instruction tuning training data. We will release the WebAlpaca dataset in a separate repository.
40
 
41
- The summary of the instruction tuning data is as follows:
42
 
43
  <!-- <center><img src="data_table.jpg" alt="Instruction Data"/></center> -->
44
- | Subset | Tokens (Million) |
45
- | ----------- | ----------- |
46
- | [OASST1-guanaco](https://huggingface.co/datasets/openaccess-ai-collective/oasst1-guanaco-extended-sharegpt) | 4.46 |
47
- | [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) | 225.63 |
48
- | [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) | 112.91 |
49
- | [Evol-ShareGPT](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 85.95 |
50
- | [ChatLogs](https://huggingface.co/datasets/winglian/chatlogs-en-cleaned) | 29.34 |
51
- | [CodeAlpaca](https://huggingface.co/datasets/lucasmccabe-lmi/CodeAlpaca-20k) | 2.62 |
52
- | [Rosetta Code](https://github.com/sahil280114/codealpaca/blob/master/data/rosetta_alpaca.json) | 7.99 |
53
- | [Evol-CodeAlpaca 1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 73.80 |
54
- | [Evol-CodeAlpaca 2](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 34.91 |
55
- | WebAlpaca | 43.67 |
56
- | [General Textbooks](https://huggingface.co/datasets/open-phi/textbooks) | 85.59 |
57
- | [Programming Books](https://huggingface.co/datasets/open-phi/programming_books_llama) | 395.63 |
58
- | Total | 1102.52 |
59
 
60
  The HTML Instruction dataset was curated by LLM360 and will be made available shortly.
61
 
 
36
 
37
  # Instruction Tuning Data
38
 
39
+ The fine-tuning data is a mix of publicly available language and code datasets, plus a orginally created dataset called **WebAlpaca**. The WebAlpaca dataset is created by us and is used as part of our instruction tuning training data. We will release the WebAlpaca dataset in a separate repository soon.
40
 
41
+ The summary of the fine-tuning data is as follows:
42
 
43
  <!-- <center><img src="data_table.jpg" alt="Instruction Data"/></center> -->
44
+ | Subset | #Tokens | Avg. #Q | Avg. Query Len | Avg. #R | Avg. Reply Len |
45
+ | ----------- | ----------- |----------- |----------- |----------- |----------- |
46
+ | [OASST1-guanaco](https://huggingface.co/datasets/openaccess-ai-collective/oasst1-guanaco-extended-sharegpt) | 4,464,640 | 1.36 | 38.28 | 1.36 | 271.69 |
47
+ | [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) |225,628,160 | 1.00 | 259.16 | 1.00 | 151.12 |
48
+ | [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) | 112,914,432 | 3.28 | 94.53 | 3.64 | 365.81 |
49
+ | [Evol-ShareGPT](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) | 85,954,560 | 1.00 | 145.99 | 1.00 | 425.17 |
50
+ | [ChatLogs](https://huggingface.co/datasets/winglian/chatlogs-en-cleaned) | 29,337,600 | 3.39 | 95.58 | 3.24 | 191.42 |
51
+ | [CodeAlpaca](https://huggingface.co/datasets/lucasmccabe-lmi/CodeAlpaca-20k) | 2,623,488 | 1.00 | 32.46 | 1.00 | 67.68 |
52
+ | [Rosetta Code](https://github.com/sahil280114/codealpaca/blob/master/data/rosetta_alpaca.json) | 7,987,200 | 1.00 | 450.09 | 1.00 | 533.52 |
53
+ | [Evol-CodeAlpaca 1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 73,803,776 | 1.00 | 210.33 | 1.00 | 437.92 |
54
+ | [Evol-CodeAlpaca 2](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 34,910,208 | 1.00 | 114.99 | 1.00 | 300.29 |
55
+ | WebAlpaca | 43,673,600 | 1.00 | 96.29 | 1.00 | 746.52 |
56
+ | [General Textbooks](https://huggingface.co/datasets/open-phi/textbooks) | 85,590,016 | Not instruction data
57
+ | [Programming Books](https://huggingface.co/datasets/open-phi/programming_books_llama) | 395,628,544 | Not instruction data
58
+ | Total | 1,102,516,224
59
 
60
  The HTML Instruction dataset was curated by LLM360 and will be made available shortly.
61