LoneWolfgang's picture
Update README.md
4d28fbd verified
|
raw
history blame
1.48 kB
metadata
library_name: transformers
license: apache-2.0
language:
  - ja
pipeline_tag: fill-mask
widget:
  - text: 彼のダンス、めっちゃ[MASK]!😂

BERT for Japanese Twitter

This is a Japanese BERT that has been adapted to Twitter.

It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.

It is reccomended to use with Japanese SNS tasks.

Training Data

The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.

N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus. The refined training corpus was 28 million tweets.

Tokenization

The vocabulary was prepared using the WordPieceTrainer with the Twitter training corpus. It shares 60% of its vocabulary with Japanese BERT.

The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: Jordan Wolfgang Klein, as Master's candiate at the University Malta.
  • Model type: BERT
  • Language(s) (NLP): Japanese
  • License: [More Information Needed]
  • Finetuned from model [optional]: