LLM360
/

CrystalChat

@@ -36,26 +36,26 @@ We also performed the same finetuning on the last **CrystalCoder** checkpoint of
 # Instruction Tuning Data
-The instruction tuning data is a mix of publicly available language and code datasets, plus a orginally created dataset called **WebAlpaca**. The WebAlpaca dataset is created by us and is used as part of our instruction tuning training data. We will release the WebAlpaca dataset in a separate repository.
-The summary of the instruction tuning data is as follows:
 <!-- <center><img src="data_table.jpg" alt="Instruction Data"/></center> -->
-| Subset      | Tokens (Million) |
-| ----------- | ----------- |
-| [OASST1-guanaco](https://huggingface.co/datasets/openaccess-ai-collective/oasst1-guanaco-extended-sharegpt)      | 4.46       |
-| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca)   | 225.63        |
-| [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered)   | 112.91        |
-| [Evol-ShareGPT](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)   | 85.95        |
-| [ChatLogs](https://huggingface.co/datasets/winglian/chatlogs-en-cleaned)   | 29.34        |
-| [CodeAlpaca](https://huggingface.co/datasets/lucasmccabe-lmi/CodeAlpaca-20k)   | 2.62        |
-| [Rosetta Code](https://github.com/sahil280114/codealpaca/blob/master/data/rosetta_alpaca.json)   | 7.99        |
-| [Evol-CodeAlpaca 1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)   | 73.80        |
-| [Evol-CodeAlpaca 2](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1)   | 34.91        |
-| WebAlpaca  | 43.67        |
-| [General Textbooks](https://huggingface.co/datasets/open-phi/textbooks)   | 85.59        |
-| [Programming Books](https://huggingface.co/datasets/open-phi/programming_books_llama)   | 395.63        |
-| Total | 1102.52 |
 The HTML Instruction dataset was curated by LLM360 and will be made available shortly.

 # Instruction Tuning Data
+The fine-tuning data is a mix of publicly available language and code datasets, plus a orginally created dataset called **WebAlpaca**. The WebAlpaca dataset is created by us and is used as part of our instruction tuning training data. We will release the WebAlpaca dataset in a separate repository soon.
+The summary of the fine-tuning data is as follows:
 <!-- <center><img src="data_table.jpg" alt="Instruction Data"/></center> -->
+| Subset      | #Tokens | Avg. #Q | Avg. Query Len | Avg. #R | Avg. Reply Len |
+| ----------- | ----------- |----------- |----------- |----------- |----------- |
+| [OASST1-guanaco](https://huggingface.co/datasets/openaccess-ai-collective/oasst1-guanaco-extended-sharegpt)      | 4,464,640       | 1.36 | 38.28 | 1.36 | 271.69 |
+| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca)   |225,628,160        | 1.00 | 259.16	| 1.00	| 151.12 |
+| [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered)   | 112,914,432        | 3.28 | 94.53	| 3.64	| 365.81 |
+| [Evol-ShareGPT](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)   | 85,954,560        | 1.00	| 145.99 |	1.00	| 425.17 |
+| [ChatLogs](https://huggingface.co/datasets/winglian/chatlogs-en-cleaned)   | 29,337,600        | 3.39	| 95.58	| 3.24	| 191.42 |
+| [CodeAlpaca](https://huggingface.co/datasets/lucasmccabe-lmi/CodeAlpaca-20k)   | 2,623,488        | 1.00	| 32.46	| 1.00	| 67.68 |
+| [Rosetta Code](https://github.com/sahil280114/codealpaca/blob/master/data/rosetta_alpaca.json)   | 7,987,200        |  1.00 |	450.09	| 1.00	| 533.52 |
+| [Evol-CodeAlpaca 1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)   | 73,803,776        | 1.00	| 210.33 | 	1.00 | 	437.92 |
+| [Evol-CodeAlpaca 2](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1)   | 34,910,208        | 1.00	| 114.99 |	1.00 |	300.29 |
+| WebAlpaca  | 43,673,600        | 1.00 |	96.29 |	1.00	| 746.52 |
+| [General Textbooks](https://huggingface.co/datasets/open-phi/textbooks)   | 85,590,016        | Not instruction data
+| [Programming Books](https://huggingface.co/datasets/open-phi/programming_books_llama)   | 395,628,544        | Not instruction data
+| Total | 1,102,516,224
 The HTML Instruction dataset was curated by LLM360 and will be made available shortly.