Abstract4700 commited on
Commit
a834836
·
verified ·
1 Parent(s): 0c46985

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -12
README.md CHANGED
@@ -1,32 +1,39 @@
1
  ---
2
  license: apache-2.0
3
- base_model:
4
- - common-pile/comma-v0.1-2t
5
  datasets:
6
  - common-pile/comma_v0.1_training_dataset
7
  language:
8
  - en
 
 
9
  pipeline_tag: text-generation
10
  ---
11
 
12
-
13
- q4 EXL2 quant of comma-v0.1-2t
14
-
15
-
16
-
17
  ## Model Description
18
 
19
- Base Model: common-pile/comma-v0.1-2t
20
 
21
  Quantization: EXL2, 4.0 bits per weight
22
 
23
- Calibration Dataset: A 200-row subset of wikitext-2-raw-v1
24
 
25
  max_seq_len: 4096
 
26
 
 
27
 
 
28
 
29
- ## Model Sources
30
- Base
31
 
32
- - **Repository:** https://huggingface.co/common-pile/comma-v0.1-2t
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
 
 
3
  datasets:
4
  - common-pile/comma_v0.1_training_dataset
5
  language:
6
  - en
7
+ base_model:
8
+ - common-pile/comma-v0.1-2t
9
  pipeline_tag: text-generation
10
  ---
11
 
 
 
 
 
 
12
  ## Model Description
13
 
 
14
 
15
  Quantization: EXL2, 4.0 bits per weight
16
 
 
17
 
18
  max_seq_len: 4096
19
+ Model Sources
20
 
21
+ Base
22
 
23
+ - **Repository:** https://huggingface.co/common-pile/comma-v0.1-2t
24
 
 
 
25
 
26
+ Comma v0.1-2T is a 7 billion parameter language model trained on 2 trillion tokens from [the Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), comprising of openly licensed text from [the Common Pile](https://huggingface.co/collections/common-pile/common-pile-v01-68307d37df48e36f02717f21).
27
+ Comma v0.1-2T is a "base model" that can be used a the starting point for finetuning and post-training.
28
+
29
+
30
+ ## Citation
31
+
32
+ ```bibtext
33
+ @article{kandpal2025common,
34
+ title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}},
35
+ author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray},
36
+ journal={arXiv preprint},
37
+ year={2025}
38
+ }
39
+ ```