Update README.md
Browse files
README.md
CHANGED
|
@@ -23,13 +23,13 @@ Our ModernBERT-Ja-30M is trained on a high-quality corpus of Japanese and Englis
|
|
| 23 |
You can use our models directly with the transformers library v4.48.0 or higher:
|
| 24 |
|
| 25 |
```bash
|
| 26 |
-
pip install -U transformers>=4.48.0
|
| 27 |
```
|
| 28 |
|
| 29 |
Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
|
| 30 |
|
| 31 |
```
|
| 32 |
-
pip install flash-attn
|
| 33 |
```
|
| 34 |
|
| 35 |
### Example Usage
|
|
@@ -55,6 +55,8 @@ for result in results:
|
|
| 55 |
|
| 56 |
## Model Series
|
| 57 |
|
|
|
|
|
|
|
| 58 |
|ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
|
| 59 |
|-|-|-|-|-|-|
|
| 60 |
|[**sbintuitions/modernbert-ja-30m**](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
|
|
@@ -62,6 +64,13 @@ for result in results:
|
|
| 62 |
|[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
|
| 63 |
|[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)|315M|236M|768|3072|25|
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
## Model Description
|
| 67 |
|
|
@@ -79,7 +88,7 @@ Next, we conducted two phases of context length extension.
|
|
| 79 |
- The sequence length is **8,192** with [best-fit packing](https://arxiv.org/abs/2404.10830).
|
| 80 |
- Masking rate is **30%** (with 80-10-10 rule).
|
| 81 |
3. **Context Extension (CE): Phase 2**
|
| 82 |
-
- Training with **450B tokens**,
|
| 83 |
- The sequence length is **8,192** without sequence packing.
|
| 84 |
- Masking rate is **15%** (with 80-10-10 rule).
|
| 85 |
|
|
@@ -145,20 +154,22 @@ For datasets with predefined `train`, `validation`, and `test` sets, we simply t
|
|
| 145 |
|
| 146 |
| Model | #Param. | #Param.<br>w/o Emb. | **Avg.** | [JComQA](https://github.com/yahoojapan/JGLUE)<br>(Acc.) | [RCQA](https://www.cl.ecei.tohoku.ac.jp/rcqa/)<br>(Acc.) | [JCoLA](https://github.com/osekilab/JCoLA)<br>(Acc.) | [JNLI](https://github.com/yahoojapan/JGLUE)<br>(Acc.) | [JSICK](https://github.com/verypluming/JSICK)<br>(Acc.) | [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)<br>(Acc.) | [KU RTE](https://nlp.ist.i.kyoto-u.ac.jp/index.php?Textual+Entailment+%E8%A9%95%E4%BE%A1%E3%83%87%E3%83%BC%E3%82%BF)<br>(Acc.) | [JSTS](https://github.com/yahoojapan/JGLUE)<br>(Spearman's ρ) | [Livedoor](https://www.rondhuit.com/download.html)<br>(Acc.) | [Toxicity](https://llm-jp.nii.ac.jp/llm/2024/08/07/llm-jp-toxicity-dataset.html)<br>(Acc.) | [MARC-ja](https://github.com/yahoojapan/JGLUE)<br>(Acc.) | [WRIME](https://github.com/ids-cv/wrime)<br>(Acc.) |
|
| 147 |
| ------ | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
|
| 148 |
-
| [**ModernBERT-Ja-30M**](https://huggingface.co/sbintuitions/modernbert-ja-30m)<br>(this model) | 37M | 10M |
|
| 149 |
| [ModernBERT-Ja-70M](https://huggingface.co/sbintuitions/modernbert-ja-70m) | 70M | 31M | 86.77 | 85.65 | 83.51 | 80.26 | 90.33 | 85.01 | 92.73 | 60.08 | 87.59 | 96.34 | 91.01 | 96.13 | 92.59 |
|
| 150 |
| [ModernBERT-Ja-130M](https://huggingface.co/sbintuitions/modernbert-ja-130m) | 132M | 80M | 88.95 | 91.01 | 85.28 | 84.18 | 92.03 | 86.61 | 94.01 | 65.56 | 89.20 | 97.42 | 91.57 | 96.48 | 93.99 |
|
| 151 |
| [ModernBERT-Ja-310M](https://huggingface.co/sbintuitions/modernbert-ja-310m) | 315M | 236M | 89.83 | 93.53 | 86.18 | 84.81 | 92.93 | 86.87 | 94.48 | 68.79 | 90.53 | 96.99 | 91.24 | 96.39 | 95.23 |
|
|
|
|
|
|
|
| 152 |
| [Tohoku BERT-base v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)| 111M | 86M | 86.74 | 82.82 | 83.65 | 81.50 | 89.68 | 84.96 | 92.32 | 60.56 | 87.31 | 96.91 | 93.15 | 96.13 | 91.91 |
|
| 153 |
| [LUKE-japanese-base-lite](https://huggingface.co/studio-ousia/luke-japanese-base-lite)| 133M | 107M | 87.15 | 82.95 | 83.53 | 82.39 | 90.36 | 85.26 | 92.78 | 60.89 | 86.68 | 97.12 | 93.48 | 96.30 | 94.05 |
|
| 154 |
| [Kyoto DeBERTa-v3](https://huggingface.co/ku-nlp/deberta-v3-base-japanese)| 160M | 86M | 88.31 | 87.44 | 84.90 | 84.35 | 91.91 | 86.22 | 93.41 | 63.31 | 88.51 | 97.10 | 92.58 | 96.32 | 93.64 |
|
| 155 |
| [KoichiYasuoka/modernbert-base-japanese-wikipedia](https://huggingface.co/KoichiYasuoka/modernbert-base-japanese-wikipedia)| 160M | 110M | 82.41 | 62.59 | 81.19 | 76.80 | 84.11 | 82.01 | 90.51 | 60.48 | 81.74 | 97.10 | 90.34 | 94.85 | 87.25 |
|
| 156 |
| | | | | | | | | | | | | | | | |
|
| 157 |
-
| [Tohoku BERT-large v2](https://huggingface.co/tohoku-nlp/bert-large-japanese-v2)| 337M | 303M | 88.36 | 86.93 | 84.81 | 82.89 | 92.05 | 85.33 | 93.32 | 64.60 | 89.11 | 97.64 | 94.38 | 96.46 | 92.77 |
|
| 158 |
| [Tohoku BERT-large char v2](https://huggingface.co/cl-tohoku/bert-large-japanese-char-v2)| 311M | 303M | 87.23 | 85.08 | 84.20 | 81.79 | 90.55 | 85.25 | 92.63 | 61.29 | 87.64 | 96.55 | 93.26 | 96.25 | 92.29 |
|
|
|
|
| 159 |
| [Waseda RoBERTa-large (Seq. 512)](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp)| 337M | 303M | 88.37 | 88.81 | 84.50 | 82.34 | 91.37 | 85.49 | 93.97 | 61.53 | 88.95 | 96.99 | 95.06 | 96.38 | 95.09 |
|
| 160 |
| [Waseda RoBERTa-large (Seq. 128)](https://huggingface.co/nlp-waseda/roberta-large-japanese-with-auto-jumanpp)| 337M | 303M | 88.36 | 89.35 | 83.63 | 84.26 | 91.53 | 85.30 | 94.05 | 62.82 | 88.67 | 95.82 | 93.60 | 96.05 | 95.23 |
|
| 161 |
-
| [LUKE-japanese-large-lite](https://huggingface.co/studio-ousia/luke-japanese-large-lite)| 414M | 379M |
|
| 162 |
| [RetrievaBERT](https://huggingface.co/retrieva-jp/bert-1.3b)| 1.30B | 1.15B | 86.79 | 80.55 | 84.35 | 80.67 | 89.86 | 85.24 | 93.46 | 60.48 | 87.30 | 97.04 | 92.70 | 96.18 | 93.61 |
|
| 163 |
| | | | | | | | | | | | | | | | |
|
| 164 |
| [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased)| 178M | 86M | 83.48 | 66.08 | 82.76 | 77.32 | 88.15 | 84.20 | 91.25 | 60.56 | 84.18 | 97.01 | 89.21 | 95.05 | 85.99 |
|
|
|
|
| 23 |
You can use our models directly with the transformers library v4.48.0 or higher:
|
| 24 |
|
| 25 |
```bash
|
| 26 |
+
pip install -U "transformers>=4.48.0"
|
| 27 |
```
|
| 28 |
|
| 29 |
Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
|
| 30 |
|
| 31 |
```
|
| 32 |
+
pip install flash-attn --no-build-isolation
|
| 33 |
```
|
| 34 |
|
| 35 |
### Example Usage
|
|
|
|
| 55 |
|
| 56 |
## Model Series
|
| 57 |
|
| 58 |
+
We provide ModernBERT-Ja in several model sizes. Below is a summary of each model.
|
| 59 |
+
|
| 60 |
|ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
|
| 61 |
|-|-|-|-|-|-|
|
| 62 |
|[**sbintuitions/modernbert-ja-30m**](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
|
|
|
|
| 64 |
|[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
|
| 65 |
|[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)|315M|236M|768|3072|25|
|
| 66 |
|
| 67 |
+
For all models,
|
| 68 |
+
the vocabulary size is 102,400,
|
| 69 |
+
the head dimension is 64,
|
| 70 |
+
and the activation function is GELU.
|
| 71 |
+
The configuration for global attention and sliding window attention consists of 1 layer + 2 layers (global–local–local).
|
| 72 |
+
The sliding window attention window context size is 128, with global_rope_theta set to 160,000 and local_rope_theta set to 10,000.
|
| 73 |
+
|
| 74 |
|
| 75 |
## Model Description
|
| 76 |
|
|
|
|
| 88 |
- The sequence length is **8,192** with [best-fit packing](https://arxiv.org/abs/2404.10830).
|
| 89 |
- Masking rate is **30%** (with 80-10-10 rule).
|
| 90 |
3. **Context Extension (CE): Phase 2**
|
| 91 |
+
- Training with **450B tokens**, including 150B tokens of high-quality Japanese data, over 3 epochs.
|
| 92 |
- The sequence length is **8,192** without sequence packing.
|
| 93 |
- Masking rate is **15%** (with 80-10-10 rule).
|
| 94 |
|
|
|
|
| 154 |
|
| 155 |
| Model | #Param. | #Param.<br>w/o Emb. | **Avg.** | [JComQA](https://github.com/yahoojapan/JGLUE)<br>(Acc.) | [RCQA](https://www.cl.ecei.tohoku.ac.jp/rcqa/)<br>(Acc.) | [JCoLA](https://github.com/osekilab/JCoLA)<br>(Acc.) | [JNLI](https://github.com/yahoojapan/JGLUE)<br>(Acc.) | [JSICK](https://github.com/verypluming/JSICK)<br>(Acc.) | [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)<br>(Acc.) | [KU RTE](https://nlp.ist.i.kyoto-u.ac.jp/index.php?Textual+Entailment+%E8%A9%95%E4%BE%A1%E3%83%87%E3%83%BC%E3%82%BF)<br>(Acc.) | [JSTS](https://github.com/yahoojapan/JGLUE)<br>(Spearman's ρ) | [Livedoor](https://www.rondhuit.com/download.html)<br>(Acc.) | [Toxicity](https://llm-jp.nii.ac.jp/llm/2024/08/07/llm-jp-toxicity-dataset.html)<br>(Acc.) | [MARC-ja](https://github.com/yahoojapan/JGLUE)<br>(Acc.) | [WRIME](https://github.com/ids-cv/wrime)<br>(Acc.) |
|
| 156 |
| ------ | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
|
| 157 |
+
| [**ModernBERT-Ja-30M**](https://huggingface.co/sbintuitions/modernbert-ja-30m)<br>(this model) | 37M | 10M | <u>85.67</u> | 80.95 | 82.35 | 78.85 | 88.69 | 84.39 | 91.79 | 61.13 | 85.94 | 97.20 | 89.33 | 95.87 | 91.61 |
|
| 158 |
| [ModernBERT-Ja-70M](https://huggingface.co/sbintuitions/modernbert-ja-70m) | 70M | 31M | 86.77 | 85.65 | 83.51 | 80.26 | 90.33 | 85.01 | 92.73 | 60.08 | 87.59 | 96.34 | 91.01 | 96.13 | 92.59 |
|
| 159 |
| [ModernBERT-Ja-130M](https://huggingface.co/sbintuitions/modernbert-ja-130m) | 132M | 80M | 88.95 | 91.01 | 85.28 | 84.18 | 92.03 | 86.61 | 94.01 | 65.56 | 89.20 | 97.42 | 91.57 | 96.48 | 93.99 |
|
| 160 |
| [ModernBERT-Ja-310M](https://huggingface.co/sbintuitions/modernbert-ja-310m) | 315M | 236M | 89.83 | 93.53 | 86.18 | 84.81 | 92.93 | 86.87 | 94.48 | 68.79 | 90.53 | 96.99 | 91.24 | 96.39 | 95.23 |
|
| 161 |
+
| | | | | | | | | | | | | | | | |
|
| 162 |
+
| [LINE DistillBERT](https://huggingface.co/line-corporation/line-distilbert-base-japanese)| 68M | 43M | 85.32 | 76.39 | 82.17 | 81.04 | 87.49 | 83.66 | 91.42 | 60.24 | 84.57 | 97.26 | 91.46 | 95.91 | 92.16 |
|
| 163 |
| [Tohoku BERT-base v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)| 111M | 86M | 86.74 | 82.82 | 83.65 | 81.50 | 89.68 | 84.96 | 92.32 | 60.56 | 87.31 | 96.91 | 93.15 | 96.13 | 91.91 |
|
| 164 |
| [LUKE-japanese-base-lite](https://huggingface.co/studio-ousia/luke-japanese-base-lite)| 133M | 107M | 87.15 | 82.95 | 83.53 | 82.39 | 90.36 | 85.26 | 92.78 | 60.89 | 86.68 | 97.12 | 93.48 | 96.30 | 94.05 |
|
| 165 |
| [Kyoto DeBERTa-v3](https://huggingface.co/ku-nlp/deberta-v3-base-japanese)| 160M | 86M | 88.31 | 87.44 | 84.90 | 84.35 | 91.91 | 86.22 | 93.41 | 63.31 | 88.51 | 97.10 | 92.58 | 96.32 | 93.64 |
|
| 166 |
| [KoichiYasuoka/modernbert-base-japanese-wikipedia](https://huggingface.co/KoichiYasuoka/modernbert-base-japanese-wikipedia)| 160M | 110M | 82.41 | 62.59 | 81.19 | 76.80 | 84.11 | 82.01 | 90.51 | 60.48 | 81.74 | 97.10 | 90.34 | 94.85 | 87.25 |
|
| 167 |
| | | | | | | | | | | | | | | | |
|
|
|
|
| 168 |
| [Tohoku BERT-large char v2](https://huggingface.co/cl-tohoku/bert-large-japanese-char-v2)| 311M | 303M | 87.23 | 85.08 | 84.20 | 81.79 | 90.55 | 85.25 | 92.63 | 61.29 | 87.64 | 96.55 | 93.26 | 96.25 | 92.29 |
|
| 169 |
+
| [Tohoku BERT-large v2](https://huggingface.co/tohoku-nlp/bert-large-japanese-v2)| 337M | 303M | 88.36 | 86.93 | 84.81 | 82.89 | 92.05 | 85.33 | 93.32 | 64.60 | 89.11 | 97.64 | 94.38 | 96.46 | 92.77 |
|
| 170 |
| [Waseda RoBERTa-large (Seq. 512)](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp)| 337M | 303M | 88.37 | 88.81 | 84.50 | 82.34 | 91.37 | 85.49 | 93.97 | 61.53 | 88.95 | 96.99 | 95.06 | 96.38 | 95.09 |
|
| 171 |
| [Waseda RoBERTa-large (Seq. 128)](https://huggingface.co/nlp-waseda/roberta-large-japanese-with-auto-jumanpp)| 337M | 303M | 88.36 | 89.35 | 83.63 | 84.26 | 91.53 | 85.30 | 94.05 | 62.82 | 88.67 | 95.82 | 93.60 | 96.05 | 95.23 |
|
| 172 |
+
| [LUKE-japanese-large-lite](https://huggingface.co/studio-ousia/luke-japanese-large-lite)| 414M | 379M | 88.94 | 88.01 | 84.84 | 84.34 | 92.37 | 86.14 | 94.32 | 64.68 | 89.30 | 97.53 | 93.71 | 96.49 | 95.59 |
|
| 173 |
| [RetrievaBERT](https://huggingface.co/retrieva-jp/bert-1.3b)| 1.30B | 1.15B | 86.79 | 80.55 | 84.35 | 80.67 | 89.86 | 85.24 | 93.46 | 60.48 | 87.30 | 97.04 | 92.70 | 96.18 | 93.61 |
|
| 174 |
| | | | | | | | | | | | | | | | |
|
| 175 |
| [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased)| 178M | 86M | 83.48 | 66.08 | 82.76 | 77.32 | 88.15 | 84.20 | 91.25 | 60.56 | 84.18 | 97.01 | 89.21 | 95.05 | 85.99 |
|