File size: 1,482 Bytes
b5793f7
 
4d28fbd
 
 
 
 
 
b5793f7
 
8b38d41
b5793f7
265a925
b5793f7
265a925
b5793f7
265a925
b5793f7
8b38d41
265a925
 
 
 
 
 
 
 
 
 
 
 
 
8b38d41
b5793f7
 
 
 
 
 
8b38d41
 
 
b5793f7
4d28fbd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
library_name: transformers
license: apache-2.0
language:
- ja
pipeline_tag: fill-mask
widget:
- text: 彼のダンス、めっちゃ[MASK]!😂
---

# BERT for Japanese Twitter

This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter.

It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.

It is reccomended to use with Japanese SNS tasks.


## Training Data

The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.

N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
The refined training corpus was 28 million tweets.

## Tokenization

The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
It shares 60% of its vocabulary with Japanese BERT.

The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Jordan Wolfgang Klein, as Master's candiate at the University Malta.
- **Model type:** BERT
- **Language(s) (NLP):** Japanese
- **License:** [More Information Needed]
- **Finetuned from model [optional]:**