Commit
·
c805715
1
Parent(s):
6c96c5e
Update README for model card
Browse files
README.md
CHANGED
@@ -1,3 +1,186 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
2 |
---
|
3 |
+
|
4 |
+
|
5 |
+
|
6 |
+
|
7 |
+
|
8 |
+
|
9 |
+
# Model Card for japanese-spoken-language-bert
|
10 |
+
|
11 |
+
<!-- Provide a quick summary of what the model is/does. [Optional] -->
|
12 |
+
These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
|
13 |
+
We used CSJ and the Japanese diet record.
|
14 |
+
CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
|
15 |
+
We only provide model parameters. You have to download other config files to use these models.
|
16 |
+
|
17 |
+
We provide three models down below:
|
18 |
+
- **1-6 layer-wise** (Folder Name: models/1-6_layer-wise)
|
19 |
+
Fine-Tuned only 1st-6th layers in Encoder on CSJ.
|
20 |
+
|
21 |
+
- **TAPT512 60k** (Folder Name: models/tapt512_60k)
|
22 |
+
Fine-Tuned on CSJ.
|
23 |
+
|
24 |
+
- **DAPT128-TAPT512** (Folder Name: models/dapt128-tap512)
|
25 |
+
Fine-Tuned on the diet record and CSJ.
|
26 |
+
|
27 |
+
# Table of Contents
|
28 |
+
|
29 |
+
- [Model Card for japanese-spoken-language-bert](#model-card-for--model_id-)
|
30 |
+
- [Table of Contents](#table-of-contents)
|
31 |
+
- [Table of Contents](#table-of-contents-1)
|
32 |
+
- [Model Details](#model-details)
|
33 |
+
- [Model Description](#model-description)
|
34 |
+
- [Training Details](#training-details)
|
35 |
+
- [Training Data](#training-data)
|
36 |
+
- [Training Procedure](#training-procedure)
|
37 |
+
- [Evaluation](#evaluation)
|
38 |
+
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
|
39 |
+
- [Testing Data](#testing-data)
|
40 |
+
- [Factors](#factors)
|
41 |
+
- [Metrics](#metrics)
|
42 |
+
- [Results](#results)
|
43 |
+
- [Citation](#citation)
|
44 |
+
- [More Information](#more-information-optional)
|
45 |
+
- [Model Card Authors](#model-card-authors-optional)
|
46 |
+
- [Model Card Contact](#model-card-contact)
|
47 |
+
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
48 |
+
|
49 |
+
|
50 |
+
# Model Details
|
51 |
+
|
52 |
+
## Model Description
|
53 |
+
|
54 |
+
<!-- Provide a longer summary of what this model is/does. -->
|
55 |
+
These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
|
56 |
+
We used CSJ and the Japanese diet record.
|
57 |
+
CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
|
58 |
+
We only provide model parameters. You have to download other config files to use these models.
|
59 |
+
|
60 |
+
We provide three models down below:
|
61 |
+
- 1-6 layer-wise (Folder Name: models/1-6_layer-wise)
|
62 |
+
Fine-Tuned only 1st-6th layers in Encoder on CSJ.
|
63 |
+
|
64 |
+
- TAPT512 60k (Folder Name: models/tapt512_60k)
|
65 |
+
Fine-Tuned on CSJ.
|
66 |
+
|
67 |
+
- DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
|
68 |
+
Fine-Tuned on the diet record and CSJ.
|
69 |
+
|
70 |
+
- **Model type:** Language model
|
71 |
+
- **Language(s) (NLP):** ja
|
72 |
+
- **License:** Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)
|
73 |
+
|
74 |
+
|
75 |
+
# Training Details
|
76 |
+
|
77 |
+
## Training Data
|
78 |
+
|
79 |
+
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
80 |
+
|
81 |
+
- 1-6 layer-wise: CSJ
|
82 |
+
- TAPT512 60K: CSJ
|
83 |
+
- DAPT128-TAPT512: The Japanese diet record and CSJ
|
84 |
+
|
85 |
+
|
86 |
+
## Training Procedure
|
87 |
+
|
88 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
89 |
+
|
90 |
+
We continuously train the pre-trained Japanese BERT model ([cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking); written BERT).
|
91 |
+
|
92 |
+
In detail, See [Japanese blog](https://tech.retrieva.jp/entry/2021/04/01/114943) or [Japanese paper](https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P4-17.pdf).
|
93 |
+
|
94 |
+
# Evaluation
|
95 |
+
|
96 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
97 |
+
|
98 |
+
## Testing Data, Factors & Metrics
|
99 |
+
|
100 |
+
### Testing Data
|
101 |
+
|
102 |
+
<!-- This should link to a Data Card if possible. -->
|
103 |
+
|
104 |
+
We use CSJ for the evaluation.
|
105 |
+
|
106 |
+
|
107 |
+
### Factors
|
108 |
+
|
109 |
+
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
110 |
+
|
111 |
+
We evaluate the following tasks on CSJ:
|
112 |
+
- Dependency Parsing
|
113 |
+
- Sentence Boundary
|
114 |
+
- Extract Important Sentence
|
115 |
+
|
116 |
+
### Metrics
|
117 |
+
|
118 |
+
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
119 |
+
|
120 |
+
- Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
|
121 |
+
- Sentence Boundary: F1 Score
|
122 |
+
- Extract Important Sentence: F1 Score
|
123 |
+
|
124 |
+
## Results
|
125 |
+
|
126 |
+
| | Dependency Parsing | Sentence Boundary | Extract Important Sentence |
|
127 |
+
| :--- | ---: | ---: | ---: |
|
128 |
+
| written BERT | 39.4 | 61.6 | 36.8 |
|
129 |
+
| 1-6 layer wise | 44.6 | 64.8 | 35.4 |
|
130 |
+
| TAPT 512 60K | - | - | 40.2 |
|
131 |
+
| DAPT128-TAPT512 | 42.9 | 64.0 | 39.7 |
|
132 |
+
|
133 |
+
|
134 |
+
# Citation
|
135 |
+
|
136 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
137 |
+
|
138 |
+
**BibTeX:**
|
139 |
+
|
140 |
+
```bib
|
141 |
+
@inproceedings{csjbert2021,
|
142 |
+
title = {CSJを用いた日本語話し言葉BERTの作成},
|
143 |
+
author = {勝又智 and 坂田大直},
|
144 |
+
booktitle = {言語処理学会第27回年次大会},
|
145 |
+
year = {2021},
|
146 |
+
}
|
147 |
+
```
|
148 |
+
|
149 |
+
|
150 |
+
# More Information
|
151 |
+
|
152 |
+
https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)
|
153 |
+
|
154 |
+
# Model Card Authors
|
155 |
+
|
156 |
+
<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
|
157 |
+
|
158 |
+
Satoru Katsumata
|
159 |
+
|
160 |
+
# Model Card Contact
|
161 |
+
|
162 |
+
More information needed
|
163 |
+
|
164 |
+
# How to Get Started with the Model
|
165 |
+
|
166 |
+
Use the code below to get started with the model.
|
167 |
+
|
168 |
+
<details>
|
169 |
+
<summary> Click to expand </summary>
|
170 |
+
|
171 |
+
1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.
|
172 |
+
|
173 |
+
```bash
|
174 |
+
python download_wikipedia_bert.py
|
175 |
+
```
|
176 |
+
|
177 |
+
This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub.
|
178 |
+
https://github.com/cl-tohoku/bert-japanese
|
179 |
+
|
180 |
+
2. Run sample_mlm.py to confirm you can use our models.
|
181 |
+
|
182 |
+
```bash
|
183 |
+
python sample_mlm.py
|
184 |
+
```
|
185 |
+
|
186 |
+
</details>
|