--- library_name: transformers license: apache-2.0 language: - sv - 'no' - da - is tags: - masked-lm - fill-mask - long-context - modernbert pipeline_tag: fill-mask inference: false base_model: answerdotai/ModernBERT-large --- ## Overview This checkpoint continues the pre-training of [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from [The Nordic Pile](https://arxiv.org/pdf/2303.17183) and [SWEb](https://arxiv.org/pdf/2410.04456) while preserving the original 8k token context window. Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens. The training is done in one stage with 8192 tokens per sample for the whole run. ## Data Sources | Corpus | Size | Selected Languages | Highlights | |---|---|---|---| | **The Nordic Pile** | 1.2 TB raw text | sv, no, da, is | Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality | | **SWEb** | 1 T+ tokens (~3.6 TB) | sv, no, da, is | 98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents | ## Training Setup | Setting | Value | |---|---| | Parameters | 395 M | | Context length | 8 192 tokens (RoPE + local-global attention) | | Tokens processed | 9.82 × 1011 / 1.20 × 1012 (≈ 82 %) | | Tokens per batch | 1 572 864 | | Global batch | 192 sequences (micro-batch = 3) | | Optimizer & schedule | Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) | | Precision | AMP-bf16 | | Hardware | 8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC **LUMI-G** system | See training details [here](https://github.com/timpal0l/ModernBERT/blob/main/training/trainer_lumi.yaml) ## Training Stats ```python [token=982585522155/1198510347252]: Train time/batch: 716208 Train time/sample: 137511936 Train time/batch_in_epoch: 716208 Train time/sample_in_epoch: 137511936 Train time/token: 982584117341 Train time/token_in_epoch: 982584117341 Train trainer/device_train_microbatch_size: 3 Train loss/train/total: 0.8162 Train throughput/batches_per_sec: 0.6466 Train throughput/samples_per_sec: 124.1393 Train throughput/device/batches_per_sec: 0.0101 Train throughput/device/samples_per_sec: 1.9397 Train throughput/tokens_per_sec: 887795.9110 Train throughput/device/tokens_per_sec: 13871.8111 Train time/train: 317.5722 Train time/val: 0.0000 Train time/total: 317.5722 Train lr-StableAdamW/group0: 0.0000 Train lr-StableAdamW/group1: 0.0000 ``` ## Intended Use * Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.). * Drop-in replacement for BERT-style encoders (omit `token_type_ids`). ## Fill-mask ```python from transformers import pipeline unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-large') unmasker("Huvudstaden i Sverige är [MASK].") ``` ```python [{'score': 0.5732529759407043, 'token': 2961, 'token_str': ' Stockholm', 'sequence': 'Huvudstaden i Sverige är Stockholm.'}, {'score': 0.06222670152783394, 'token': 4481, 'token_str': ' Göteborg', 'sequence': 'Huvudstaden i Sverige är Göteborg.'}, {'score': 0.02539575845003128, 'token': 5882, 'token_str': ' Malmö', 'sequence': 'Huvudstaden i Sverige är Malmö.'}, {'score': 0.024683712050318718, 'token': 19931, 'token_str': ' Norrköping', 'sequence': 'Huvudstaden i Sverige är Norrköping.'}, {'score': 0.02418600209057331, 'token': 28202, 'token_str': ' Solna', 'sequence': 'Huvudstaden i Sverige är Solna.'}] ``` ## Limitations & Biases * Web corpora can contain noise, stereotypes and sensitive content despite filtering. * RoPE extrapolation beyond 8 k tokens is untested and may degrade. ## Code to reproduce * [Training](https://github.com/timpal0l/ModernBERT/tree/main/training) * [Data Processing](https://github.com/timpal0l/ModernBERT/tree/main/tokenizer)