LoneWolfgang
/

bert-for-japanese-twitter

Inference Endpoints

Model card Files Files and versions Community

bert-for-japanese-twitter / README.md

LoneWolfgang's picture

Update README.md

4d28fbd verified 6 months ago

|

1.48 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- ja
	pipeline_tag: fill-mask
	widget:
	- text: 彼のダンス、めっちゃ[MASK]！😂
	---

	# BERT for Japanese Twitter

	This is a [Japanese BERT](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3) that has been adapted to Twitter.

	It began with the base Japnese BERT by Tohoku NLP and continued pretraining on a Twitter corpus.

	It is reccomended to use with Japanese SNS tasks.


	## Training Data

	The Twitter API was used to collect Japnaese Tweets from June 2022 to April 2023.

	N-gram based deduplication was used to reduce spam content and improve the diversity of the training corpus.
	The refined training corpus was 28 million tweets.

	## Tokenization

	The vocabulary was prepared using the [WordPieceTrainer](https://huggingface.co/docs/tokenizers/api/trainers) with the Twitter training corpus.
	It shares 60% of its vocabulary with Japanese BERT.

	The vocuabulary includes colloquialisms, neologisms, emoji and kaomoji expressions that are common on Twitter.

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: Jordan Wolfgang Klein, as Master's candiate at the University Malta.
	- Model type: BERT
	- Language(s) (NLP): Japanese
	- License: [More Information Needed]
	- Finetuned from model [optional]: