LoneWolfgang commited on
Commit
265a925
·
verified ·
1 Parent(s): 8b38d41

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -5
README.md CHANGED
@@ -5,15 +5,26 @@ tags: []
5
 
6
  # BERT for Japanese Twitter
7
 
8
- This is a pre-trained BERT model that specializes in Japanese Twitter.
9
 
 
10
 
11
- ## Model Details
12
 
13
- This model was adapated from the base version (v3) of Japanese BERT by the Tohoku NLP group.
14
 
15
- It was pre-trained on a corpus of 28 million tweets with masked language modelling.
16
- The vocabulary was prepared using the WordPieceTrainer
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ### Model Description
19
 
 
5
 
6
  # BERT for Japanese Twitter
7
 
8
+ This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter.
9
 
10
+ It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.
11
 
12
+ It is reccomended to use with Japanese SNS tasks.
13
 
 
14
 
15
+ ## Training Data
16
+
17
+ The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.
18
+
19
+ N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
20
+ The refined training corpus was 28 million tweets.
21
+
22
+ ## Tokenization
23
+
24
+ The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
25
+ It shares 60% of its vocabulary with Japanese BERT.
26
+
27
+ The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.
28
 
29
  ### Model Description
30