File size: 1,482 Bytes
b5793f7 4d28fbd b5793f7 8b38d41 b5793f7 265a925 b5793f7 265a925 b5793f7 265a925 b5793f7 8b38d41 265a925 8b38d41 b5793f7 8b38d41 b5793f7 4d28fbd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
---
library_name: transformers
license: apache-2.0
language:
- ja
pipeline_tag: fill-mask
widget:
- text: 彼のダンス、めっちゃ[MASK]!😂
---
# BERT for Japanese Twitter
This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter.
It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.
It is reccomended to use with Japanese SNS tasks.
## Training Data
The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.
N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
The refined training corpus was 28 million tweets.
## Tokenization
The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
It shares 60% of its vocabulary with Japanese BERT.
The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Jordan Wolfgang Klein, as Master's candiate at the University Malta.
- **Model type:** BERT
- **Language(s) (NLP):** Japanese
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** |