LoneWolfgang
/

bert-for-japanese-twitter

Model card Files Files and versions Community

LoneWolfgang commited on Aug 5, 2024

Commit

265a925

·

verified ·

1 Parent(s): 8b38d41

Update README.md

Files changed (1) hide show

README.md +16 -5

README.md CHANGED Viewed

@@ -5,15 +5,26 @@ tags: []
 # BERT for Japanese Twitter
-This is a pre-trained BERT model that specializes in Japanese Twitter.
-## Model Details
-This model was adapated from the base version (v3) of Japanese BERT by the Tohoku NLP group.
-It was pre-trained on a corpus of 28 million tweets with masked language modelling.
-The vocabulary was prepared using the WordPieceTrainer
 ### Model Description

 # BERT for Japanese Twitter
+This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter.
+It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.
+It is reccomended to use with Japanese SNS tasks.
+## Training Data
+The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.
+N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
+The refined training corpus was 28 million tweets.
+## Tokenization
+The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
+It shares 60% of its vocabulary with Japanese BERT.
+The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.
 ### Model Description