alea-institute commited on
Commit
25ff30c
·
verified ·
1 Parent(s): 35d8b37

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +181 -0
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: cc-by-4.0
6
+ tags:
7
+ - kl3m
8
+ - kl3m-002
9
+ - legal
10
+ - financial
11
+ - enterprise
12
+ - slm
13
+ date: '2024-02-20T00:00:00.000Z'
14
+ pipeline_tag: text-generation
15
+ widget:
16
+ - text: "Medical devices are regulated by"
17
+ - temperature: 0.3
18
+ - do_sample: True
19
+ ---
20
+
21
+ # kl3m-170m Model
22
+
23
+ kl3m-170m is a (very) small language model (SLM) model trained on clean, legally-permissible data. Originally
24
+ developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
25
+ kl3m-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
26
+ for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
27
+ with a focus on low toxicity and high efficiency.
28
+
29
+ Given its small size and lack of instruction-aligned training data, kl3m-170m is best suited for use either in
30
+ SLM fine-tuning or as part of training larger models without using unethical data or models.
31
+
32
+ The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
33
+ being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
34
+
35
+ ## Source
36
+
37
+ [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
38
+
39
+
40
+ ## Training Data
41
+ While the original training data collection and training infrastructure relies on software that was not donated by
42
+ 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
43
+
44
+ [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
45
+
46
+ Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
47
+ zero-cost distribution model as soon as we can obtain additional support.
48
+
49
+ This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
50
+ we believe is trained solely on public domain material. However, so as to enforce maximum transparency to all
51
+ downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
52
+
53
+ ## Model Details
54
+
55
+ ### Summary
56
+ - **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
57
+ - **Parameters**: 170 million
58
+ - **Context Window**: 4,096 tokens (true size, no sliding window)
59
+ - **Language(s)**: Primarily English
60
+ - **Tokenizer**: kl3m-001-32k BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
61
+ - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
62
+ - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
63
+ - **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1
64
+
65
+ ## Performance Metrics
66
+
67
+ ### Perplexity Scores
68
+ | Dataset | Score |
69
+ |---------------|--------|
70
+ | Wiki | 19.58 |
71
+ | CNN/Daily Mail| 11.20 |
72
+ | Legal Domain | 2.31 |
73
+
74
+ The model demonstrates particularly strong per-parameter performance on legal domain content, outperforming many
75
+ larger models as of its training data.
76
+
77
+ ## Key Features
78
+
79
+ - **Clean Training Data**: Built on what was originally referred to as the Kelvin Legal DataPack, ensuring all training data is ethically sourced and legally permissible.
80
+ - **Low Toxicity**: [Empirically lower toxicity and bias](https://github.com/alea-institute/kl3m-toxicity)
81
+ - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
82
+ - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
83
+
84
+ ## Use Cases
85
+
86
+ - Basic regulatory question answering
87
+ - Contract provision drafting
88
+ - Structured JSON information extraction
89
+ - Foundation for downstream optimization
90
+ - Base model for domain-specific fine-tuning
91
+
92
+ ## Getting Started
93
+
94
+ ```python
95
+ import json
96
+ from transformers import pipeline
97
+
98
+ # Load the model and tokenizer
99
+ p = pipeline('text-generation', 'alea-institute/kl3m-002-170m', device='cpu')
100
+
101
+ # Example usage on CPU
102
+ text = "Under this"
103
+ print(
104
+ json.dumps(
105
+ [
106
+ r.get("generated_text")
107
+ for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32)
108
+ ],
109
+ indent=2
110
+ )
111
+ )
112
+ ```
113
+
114
+ ```json
115
+ [
116
+ "Under this proposed rule, the Federal agency must determine the effect on State, local, and",
117
+ "Under this proposed rule, we are proposing to amend the definition of \u201ccovered product\u201d in ",
118
+ "Under this proposed rule, the FAA is considering issuing this proposed rule after evaluating the information"
119
+ ]
120
+ ```
121
+
122
+ ## Contract Example
123
+ ```python
124
+ text = "Governing Law.\n"
125
+ print(
126
+ json.dumps(
127
+ [
128
+ r.get("generated_text")
129
+ for r in p(text, do_sample=True, temperature=0.3, num_return_sequences=3, max_new_tokens=32)
130
+ ],
131
+ indent=2
132
+ )
133
+ )
134
+ ```
135
+
136
+ ```json
137
+ [
138
+ "Governing Law.\n The provisions of the Plan shall be construed and enforced in accordance with",
139
+ "Governing Law.\n The laws of the State of Delaware shall govern the validity, construction, and",
140
+ "Governing Law.\n The laws of the State of New York shall govern the validity, construction, enforcement"
141
+ ]
142
+ ```
143
+
144
+ ## Technical Implementation
145
+
146
+ The model implements several techniques during training:
147
+
148
+ - Hybrid NTP and SFT cotraining
149
+ - Dynamic, document-aware segmentation
150
+ - Randomized padding
151
+ - Traditional fixed- attention mechanisms
152
+
153
+ ## License
154
+
155
+ This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
156
+
157
+ The model weights are released under the CC-BY 4.0 License.
158
+
159
+ ## Contact
160
+
161
+ The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
162
+
163
+ - GitHub: https://github.com/alea-institute/kl3m-model-research
164
+ - Email: [email protected]
165
+ - Website: https://aleainstitute.ai
166
+
167
+ ## Acknowledgments
168
+
169
+ Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
170
+
171
+
172
+ ## Citation
173
+
174
+ Tokenizer, dataset, and model publications are pending.
175
+
176
+ ## Contact
177
+
178
+ For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
179
+ create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
180
+
181
+ ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)