AdminBERT 16GB: A French Language Model adapted to administrative documents

AdminBERT-16GB is a French language model adapted on a large corpus of 50 millions French administrative texts. It is a derivative of CamemBERT model, which is based on the RoBERTa architecture. AdminBERT-16GB is trained using the Whole Word Masking (WWM) objective with 30% mask rate for 3 epochs on 24 A100 GPUs. The dataset used for training is Adminset.

Evaluation

Regarding the fact that at date, there was no evaluation coprus available compose of French administrative documents, we decide to create our own on the NER (Named Entity Recognition) task.

Model Performance

Model	P (%)	R (%)	F1 (%)
Wikineural-NER FT	77.49	75.40	75.70
NERmemBERT-Large FT	77.43	78.38	77.13
CamemBERT FT	77.62	79.59	77.26
NERmemBERT-Base FT	77.99	79.59	78.34
AdminBERT-NER 4G	78.47	80.35	79.26
AdminBERT-NER 16GB	78.79	82.07	80.11

To evaluate each model, we performed five runs and averaged the results on the test set of Adminset-NER.

taln-ls2n
/

AdminBERT-16GB

AdminBERT 16GB: A French Language Model adapted to administrative documents

Evaluation

Model Performance

Dataset used to train taln-ls2n/AdminBERT-16GB