AdminBERT 4GB: A Small French Language model adapted to Administrative documents

AdminBERT-4GB is a French language model adapted on a large corpus of 10 millions French administrative texts. It is a derivative of CamemBERT model, which is based on the RoBERTa architecture. AdminBERT-4GB is trained using the Whole Word Masking (WWM) objective with 30% mask rate for 2 epochs on 8 V100 GPUs. The dataset used for training is a sample of Adminset.

Evaluation

Regarding the fact that at date, there was no evaluation coprus available compose of French administrative documents, we decide to create our own on the NER (Named Entity Recognition) task.

Model Performance

Model P (%) R (%) F1 (%)
Wikineural-NER FT 77.49 75.40 75.70
NERmemBERT-Large FT 77.43 78.38 77.13
CamemBERT FT 77.62 79.59 77.26
NERmemBERT-Base FT 77.99 79.59 78.34
AdminBERT-NER 4G 78.47 80.35 79.26
AdminBERT-NER 16GB 78.79 82.07 80.11

To evaluate each model, we performed five runs and averaged the results on the test set of Adminset-NER.

Downloads last month
123
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train taln-ls2n/AdminBERT-4GB