library_name: transformers
license: apache-2.0
language:
- ja
pipeline_tag: fill-mask
widget:
- text: 彼のダンス、めっちゃ[MASK]!😂
BERT for Japanese Twitter
This is a Japanese BERT that has been adapted to Twitter.
It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.
It is reccomended to use with Japanese SNS tasks.
Training Data
The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.
N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus. The refined training corpus was 28 million tweets.
Tokenization
The vocabulary was prepared using the WordPieceTrainer with the Twitter training corpus. It shares 60% of its vocabulary with Japanese BERT.
The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: Jordan Wolfgang Klein, as Master's candiate at the University Malta.
- Model type: BERT
- Language(s) (NLP): Japanese
- License: [More Information Needed]
- Finetuned from model [optional]: