File size: 1,540 Bytes
b5793f7
 
4d28fbd
 
 
 
 
 
b5793f7
 
8b38d41
b5793f7
b77c5f8
b5793f7
b77c5f8
b5793f7
b77c5f8
b5793f7
8b38d41
2964ff2
265a925
 
f99eca7
265a925
 
 
 
 
 
 
 
 
f99eca7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
---
library_name: transformers
license: apache-2.0
language:
- ja
pipeline_tag: fill-mask
widget:
- text: 彼のダンス、めっちゃ[MASK]!😂
---

# BERT for Japanese Twitter

This is a base BERT model that has been adapted for Japanese Twitter.

It was adapted from [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) by preparing a specialized vocabulary and continuing pretraining on a Twitter corpus.

This model is reccomended for Japanese SNS tasks, like [sentiment analysis](https://huggingface.co/datasets/shunk031/wrime) and [defamation detection](https://huggingface.co/datasets/kubota/defamation-japanese-twitter).


This model has been used to finetune a series of models. The main ones are [BERT for Japanese Twitter Sentiment](https://huggingface.co/LoneWolfgang/bert-for-japanese-twitter-sentiment) and [BERT for Japanese Twitter Emotion](https://huggingface.co/LoneWolfgang/bert-for-japanese-twitter-emotion).
## Training Data

The Twitter API was used to collect Japanese tweets from June 2022 to April 2023.

N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
The refined training corpus was 28 million tweets.

## Tokenization

The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
It shares 60% of its vocabulary with Japanese BERT.

The vocabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.