|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- ja |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: 彼のダンス、めっちゃ[MASK]!😂 |
|
--- |
|
|
|
# BERT for Japanese Twitter |
|
|
|
This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter. |
|
|
|
It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus. |
|
|
|
It is reccomended to use with Japanese SNS tasks. |
|
|
|
|
|
## Training Data |
|
|
|
The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023. |
|
|
|
N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus. |
|
The refined training corpus was 28 million tweets. |
|
|
|
## Tokenization |
|
|
|
The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus. |
|
It shares 60% of its vocabulary with Japanese BERT. |
|
|
|
The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter. |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Developed by:** Jordan Wolfgang Klein, as Master's candiate at the University Malta. |
|
- **Model type:** BERT |
|
- **Language(s) (NLP):** Japanese |
|
- **License:** [More Information Needed] |
|
- **Finetuned from model [optional]:** |