LoneWolfgang
/

bert-for-japanese-twitter

Inference Endpoints

Model card Files Files and versions Community

bert-for-japanese-twitter / README.md

LoneWolfgang's picture

Update README.md

2964ff2 verified 6 months ago

|

history blame contribute delete

1.54 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- ja
	pipeline_tag: fill-mask
	widget:
	- text: 彼のダンス、めっちゃ[MASK]！😂
	---

	# BERT for Japanese Twitter

	This is a base BERT model that has been adapted for Japanese Twitter.

	It was adapted from [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) by preparing a specialized vocabulary and continuing pretraining on a Twitter corpus.

	This model is reccomended for Japanese SNS tasks, like [sentiment analysis](https://huggingface.co/datasets/shunk031/wrime) and [defamation detection](https://huggingface.co/datasets/kubota/defamation-japanese-twitter).


	This model has been used to finetune a series of models. The main ones are [BERT for Japanese Twitter Sentiment](https://huggingface.co/LoneWolfgang/bert-for-japanese-twitter-sentiment) and [BERT for Japanese Twitter Emotion](https://huggingface.co/LoneWolfgang/bert-for-japanese-twitter-emotion).
	## Training Data

	The Twitter API was used to collect Japanese tweets from June 2022 to April 2023.

	N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
	The refined training corpus was 28 million tweets.

	## Tokenization

	The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
	It shares 60% of its vocabulary with Japanese BERT.

	The vocabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.