nielsr HF Staff commited on
Commit
d7bf2c5
Β·
verified Β·
1 Parent(s): af0b10c

Update pipeline tag and add library name for `ettin-decoder-32m`

Browse files

This PR improves the model card for the `jhu-clsp/ettin-decoder-32m` model by:

* Changing the `pipeline_tag` from `fill-mask` to `text-generation`, accurately reflecting its primary use case as a decoder-only model for generative tasks. This ensures the model appears under the correct filter on the Hugging Face Hub (https://huggingface.co/models?pipeline_tag=text-generation).
* Adding `library_name: transformers` to the metadata, enabling the "how to use" widget and better integration with the Hugging Face ecosystem, as the model is compatible with the library.

Files changed (1) hide show
  1. README.md +45 -42
README.md CHANGED
@@ -1,9 +1,11 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
- pipeline_tag: fill-mask
 
 
6
  ---
 
7
  # Ettin: an Open Suite of Paired Encoders and Decoders
8
 
9
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -82,11 +84,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
82
 
83
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
84
 
85
- 1. **Identical training data** - Same high-quality mixture across all models
86
- 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
87
- 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
88
- 4. **Consistent training recipe** - Three-phase training with 2T tokens
89
- 5. **Multiple scales** - From 17M to 1B parameters
90
 
91
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
92
 
@@ -94,10 +96,10 @@ This approach allows for true apples-to-apples comparisons between encoder and d
94
 
95
  The training data is publicly available and split across different phases:
96
 
97
- - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
98
- - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
99
- - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
100
- - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
101
 
102
  ## Model Family
103
 
@@ -143,13 +145,13 @@ These models demonstrate what happens when you continue training encoders as dec
143
  **Load as decoders** using `AutoModelForCausalLM`:
144
 
145
  | Size | Model | Parameters | Description | Download |
146
- |:-----|:------|:-----------|:------------|:---------|
147
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
148
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
149
- | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) |
150
- | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) |
151
- | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) |
152
- | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) |
153
 
154
  **Example Usage for Cross-Objective Models:**
155
  ```python
@@ -174,9 +176,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
174
  #### HuggingFace Format Checkpoints
175
  Each model repository contains multiple tagged versions representing different training stages:
176
 
177
- - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
178
- - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
179
- - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
180
 
181
  ```python
182
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -209,27 +211,27 @@ This checkpoint availability enables detailed analysis of training dynamics, los
209
 
210
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
211
 
212
- - **Identical Training Data**: Same 2T token mixture across all models
213
- - **Matched Architectures**: Only attention patterns and objectives differ
214
- - **Open Everything**: Training data, model weights, and batch-level training order
215
- - **Multiple Scales**: Fair comparison from 17M to 1B parameters
216
- - **250+ Checkpoints**: Complete training trajectory analysis
217
 
218
  ### Use Cases for Researchers
219
 
220
- - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
221
- - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
222
- - **Scaling Laws**: Study how architectural advantages change with scale
223
- - **Transfer Learning**: Investigate cross-objective training effectiveness
224
- - **Replication Studies**: First open replication of ModernBERT training recipe
225
 
226
  ### Reproducibility
227
 
228
  All training artifacts are publicly available:
229
- - Training data with exact batch ordering
230
- - Model checkpoints every 8.5B tokens
231
- - Complete hyperparameter configurations
232
- - Training code and evaluation scripts
233
 
234
  ## Training Details
235
 
@@ -238,14 +240,14 @@ All training artifacts are publicly available:
238
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
239
 
240
  **Training Phases:**
241
- - **Pre-training**: 1.7T tokens with diverse data mixture
242
- - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
243
- - **Decay phase**: 100B tokens with premium data sources
244
 
245
  **Key Features:**
246
- - Context length: Up to 8K tokens
247
- - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
248
- - Deep but efficient architectures following MobileLLM principles
249
 
250
  ## Model Architecture
251
 
@@ -262,7 +264,7 @@ All training artifacts are publicly available:
262
 
263
  ### Encoder: Masked Language Modeling
264
  <details>
265
- <summary>Click to expand <strong>encoder</strong> usage examples</summary>
266
 
267
  ```python
268
  from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -296,7 +298,7 @@ print(f"Predictions: {predictions}")
296
  ### Decoder: Text Generation
297
 
298
  <details>
299
- <summary>Click to expand <strong>decoder text generation</strong></summary>
300
 
301
  ```python
302
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -783,7 +785,8 @@ def main():
783
  model.push_to_hub(run_name)
784
  except Exception:
785
  logging.error(
786
- f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
 
787
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
788
  f"and saving it using `model.push_to_hub('{run_name}')`."
789
  )
 
1
  ---
 
2
  language:
3
  - en
4
+ license: mit
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
  ---
8
+
9
  # Ettin: an Open Suite of Paired Encoders and Decoders
10
 
11
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
84
 
85
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
86
 
87
+ 1. **Identical training data** - Same high-quality mixture across all models
88
+ 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
89
+ 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
90
+ 4. **Consistent training recipe** - Three-phase training with 2T tokens
91
+ 5. **Multiple scales** - From 17M to 1B parameters
92
 
93
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
94
 
 
96
 
97
  The training data is publicly available and split across different phases:
98
 
99
+ - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
100
+ - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
101
+ - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
102
+ - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
103
 
104
  ## Model Family
105
 
 
145
  **Load as decoders** using `AutoModelForCausalLM`:
146
 
147
  | Size | Model | Parameters | Description | Download |
148
+ |:-----|:------|:-----------|:---------|:---------|
149
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
150
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
151
+ | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-68m) |
152
+ | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-150m) |
153
+ | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-400m) |
154
+ | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-1b) |
155
 
156
  **Example Usage for Cross-Objective Models:**
157
  ```python
 
176
  #### HuggingFace Format Checkpoints
177
  Each model repository contains multiple tagged versions representing different training stages:
178
 
179
+ - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
180
+ - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
181
+ - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
182
 
183
  ```python
184
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
211
 
212
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
213
 
214
+ - **Identical Training Data**: Same 2T token mixture across all models
215
+ - **Matched Architectures**: Only attention patterns and objectives differ
216
+ - **Open Everything**: Training data, model weights, and batch-level training order
217
+ - **Multiple Scales**: Fair comparison from 17M to 1B parameters
218
+ - **250+ Checkpoints**: Complete training trajectory analysis
219
 
220
  ### Use Cases for Researchers
221
 
222
+ - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
223
+ - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
224
+ - **Scaling Laws**: Study how architectural advantages change with scale
225
+ - **Transfer Learning**: Investigate cross-objective training effectiveness
226
+ - **Replication Studies**: First open replication of ModernBERT training recipe
227
 
228
  ### Reproducibility
229
 
230
  All training artifacts are publicly available:
231
+ - Training data with exact batch ordering
232
+ - Model checkpoints every 8.5B tokens
233
+ - Complete hyperparameter configurations
234
+ - Training code and evaluation scripts
235
 
236
  ## Training Details
237
 
 
240
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
241
 
242
  **Training Phases:**
243
+ - **Pre-training**: 1.7T tokens with diverse data mixture
244
+ - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
245
+ - **Decay phase**: 100B tokens with premium data sources
246
 
247
  **Key Features:**
248
+ - Context length: Up to 8K tokens
249
+ - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
250
+ - Deep but efficient architectures following MobileLLM principles
251
 
252
  ## Model Architecture
253
 
 
264
 
265
  ### Encoder: Masked Language Modeling
266
  <details>
267
+ <summary>Click to expand **encoder** usage examples</summary>
268
 
269
  ```python
270
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
298
  ### Decoder: Text Generation
299
 
300
  <details>
301
+ <summary>Click to expand **decoder text generation**</summary>
302
 
303
  ```python
304
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
785
  model.push_to_hub(run_name)
786
  except Exception:
787
  logging.error(
788
+ f"Error uploading model to the Hugging Face Hub:
789
+ {traceback.format_exc()}To upload it manually, you can run "
790
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
791
  f"and saving it using `model.push_to_hub('{run_name}')`."
792
  )