LoneWolfgang's picture
Update README.md
4d28fbd verified
|
raw
history blame
1.48 kB
---
library_name: transformers
license: apache-2.0
language:
- ja
pipeline_tag: fill-mask
widget:
- text: 彼のダンス、めっちゃ[MASK]!😂
---
# BERT for Japanese Twitter
This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter.
It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.
It is reccomended to use with Japanese SNS tasks.
## Training Data
The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.
N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
The refined training corpus was 28 million tweets.
## Tokenization
The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
It shares 60% of its vocabulary with Japanese BERT.
The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Jordan Wolfgang Klein, as Master's candiate at the University Malta.
- **Model type:** BERT
- **Language(s) (NLP):** Japanese
- **License:** [More Information Needed]
- **Finetuned from model [optional]:**