File size: 1,482 Bytes

---
library_name: transformers
license: apache-2.0
language:
- ja
pipeline_tag: fill-mask
widget:
- text: 彼のダンス、めっちゃ[MASK]！😂
---

# BERT for Japanese Twitter

This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter.

It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.

It is reccomended to use with Japanese SNS tasks.


## Training Data

The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.

N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
The refined training corpus was 28 million tweets.

## Tokenization

The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
It shares 60% of its vocabulary with Japanese BERT.

The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Jordan Wolfgang Klein, as Master's candiate at the University Malta.
- **Model type:** BERT
- **Language(s) (NLP):** Japanese
- **License:** [More Information Needed]
- **Finetuned from model [optional]:**