Spaces:

ml-energy
/

leaderboard

Running

AmberLJC commited on Jun 15, 2023

Commit

906b628

1 Parent(s): 19b22c9

sharegpt

Files changed (4) hide show

sharegpt/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

sharegpt/README.md ADDED Viewed

+## Download ShareGPT :
+```
+https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json
+https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part2_html_cleaned.json
+```
+## Install Fastchat
+```
+pip3 install fastchat
+```
+## Clean data:
+```
+pip3 install polyglot pyicu pycld2
+python3 -m fastchat.data.optional_clean --in sg_90k_part1_html_cleaned.json --out sg_90k_part1_html_cleaned_lang.json --keep-lang en
+```
+## Extract first sentence (optional)
+```
+python extract_first.py --in-file sg_90k_part1_html_cleaned_lang.json --out-file sg_90k_part1_html_cleaned_lang_first.json
+```
+## Sample data (optional)
+```
+python3 -m fastchat.data.sample --in sg_90k_part1_html_cleaned_lang_first.json --out sg_90k_part1_html_cleaned_lang_first_sampled.json --end 10000 --max-length 10000
+```
+## ShareGPT Feeder Usage
+```
+from sharegpt_feeder import generator
+sharegpt_generator = generator()
+print(next(sharegpt_generator))
+print(next(sharegpt_generator))
+```

sharegpt/extract_first.py ADDED Viewed

+import argparse
+import json
+def extract_first_sen(content):
+    result = []
+    for item in content:
+        tmp = item
+        tmp['conversations'] = [item['conversations'][0]]
+        result.append(tmp)
+    return result
+def main(args):
+    content = json.load(open(args["in_file"], "r"))
+    content = extract_first_sen(content )
+    json.dump(content, open(args["out_file"], "w"), indent=2)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--in-file", type=str, default = 'sg_90k_part1_html_cleaned_lang.json' )
+    parser.add_argument("--out-file", type=str, default = "sg_90k_part1_html_cleaned_lang_first.json")
+    args = parser.parse_args()
+    main(vars(args))

sharegpt/sharegpt_feeder.py ADDED Viewed

+''' Usage
+sharegpt_generator = sharegpt_generator()
+print(next(sharegpt_generator))
+print(next(sharegpt_generator))
+print(next(sharegpt_generator))
+'''
+import json
+def sharegpt_generator(file = 'sg_90k_part1_html_cleaned_lang.json'):
+    content = json.load(open(file, "r"))
+    for item in content:
+        yield item['conversations'][0]['value']