---
license: apache-2.0
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This is the model card for https://huggingface.co/ycffm/MBart50-legalform-remover

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->


- **Developed by:** Cheng. Y
- **Language(s) (NLP):** 
- For the language codes, please refer to mbart50 introduction
- Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)
- **Finetuned from model:** facebook/mbart-large-50
  Tang, Y., Tran, C., Li, X., Chen, P., Goyal, N., Chaudhary, V., Gu, J., & Fan, A. (2020). Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. Retrieved from https://arxiv.org/abs/2008.00401


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

The MBart50 Legal Form Remover is a multilingual sequence-to-sequence model fine-tuned from Facebook’s mBART-50 for the task of legal form simplification in company names. This model is designed to process company names in multiple languages, identify and remove legal entity descriptors (e.g., “Inc.”, “LLC”, “GmbH”, “株式会社”), and output a simplified version of the name in English while preserving its core elements.
The model intends to provide a way to remove the legal format from the original organization name and transform the company name into simplified English representation.
This model is particularly developed for tasks like standardizing company names across multilingual datasets, simplifying legal documents, or preprocessing data for downstream tasks like entity resolution, search, and text analysis.

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
The model is directly used to standardize company names by removing legal forms and converting them to normalized English names. Example applications include:
- Legal document simplification
- Business name standardization in multilingual contexts
- Data cleaning for entity resolution


```python
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

# Load the fine-tuned model and tokenizer
model = MBartForConditionalGeneration.from_pretrained("ycffm/MBart50-legalform-remover")
tokenizer = MBart50TokenizerFast.from_pretrained("ycffm/MBart50-legalform-remover")

# Example usage: This might take several minutes, please be patient. Or you can download into your own repo and run it from there.
test_cases = [
    {"input_text": "Volkswagen Aktiengesellschaft", "source_lang": "de_DE"},  # German
    {"input_text": "株式会社日本カストディ銀行", "source_lang": "ja_XX"},         # Japanese
    {"input_text": "مركز الشدوحي", "source_lang": "ar_AR"},                   # Arabic
    {"input_text": "L'Oréal Société Anonyme", "source_lang": "fr_XX"},       # French
    {"input_text": "삼성전자 주식회사", "source_lang": "ko_KR"},                 # Korean
    {"input_text": "Koninklijke Philips N.V.", "source_lang": "nl_XX"},       # Dutch
    {"input_text": "浙江海氏实业集团有限公司", "source_lang": "zh_CN"},       # Simplified Chinese
    {"input_text": "현대자동차주식회사", "source_lang": "ko_KR"}       # Korean

]

for case in test_cases:
    input_text = case["input_text"]
    source_lang = case["source_lang"]

    # Set source language
    tokenizer.src_lang = source_lang

    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=64, truncation=True)

    # Generate output
    outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)

    # Decode and print result
    print(f"Input: {input_text}")
    print(f"Predicted Simplified Name: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
    print("-" * 50)
```


### Downstream Use [optional]

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

Potential for fine-tuning or adaptation to other name normalization tasks across industries.


### Out-of-Scope Use

This model is not intended for generating synthetic company names or translating entire sentences. Misuse for generating biased or offensive content is out of scope.


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The model is mainly trained on EU languages, Arabic, Chinese (simplified/classical), Korean, Japanese and Russian.
The performance on Arabic, however, could be compromised given the original ability of the base model and limited sample size for this language.


### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

The author is actively searching for co-lab partners to improve the performance on the arabic and nordic European languages.

## How to Get Started with the Model

You can use the code displayed above, or download the files from the directory and use it locally.


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
real_companies_1             26764 --real company names in EU languages, Russian and various other languages usually in LATIN format.

real_companies_2             24790 --real company names in EU languages, Russian and various other languages usually in LATIN format.

real_companies_arabic         2317 --real company names in Arabic

real_companies_ea            20328 --real company names in Chinese, Korean, Japanese

synthetic_companies_eu       20000 --synthetic company names in EU languages

The entire dataset was split in 8-1-1 as training-validation-testing set
A typical data entry is

重庆海沛汽车保险代理有限公司 , Chongqing Haipai Automobile Insurance Agency , zh_CN as

Original company name with legal format in source language, English company name without legal format, source language
The target language is always en_XX


### Training Procedure

Phase 1: Training on the synthetic data to capture the general legal form removal process.
Phase 2: Training on the real world data to capture the multi-lingual legal form and its removal pattern.


#### Training Hyperparameters

- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	•	Training regime: Mixed precision (fp16) for speed and efficiency
	•	Batch size: 8
	•	Learning rate: 5e-5
	•	Epochs: 2 per phase


#### Metrics


Cross Entropy as Loss, Perplexity

### Results

- Validation Loss: 0.044 (Phase 2)
- Perplexity: 1.045 (Phase 2)


## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** T4 VRAM15gb
- **Hours used:** 4 hours for phase 1, 3 hours for phase 2
- **Cloud Provider:** Google Colab


## Citation for fine-tuned model from this author

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
This model is fine-tuned based on mbart50

**BibTeX:**

@misc{mbart50-legalform-remover,
  title = {MBart50 Legal Form Remover},
  author = {Cheng.Y},
  year = {2024},
  url = {https://huggingface.co/ycffm/MBart50-legalform-remover},
  note = {Fine-tuned for multilingual legal form simplification tasks.}
}

**APA:**

Cheng.Y. (2024). MBart50 Legal Form Remover. Retrieved from https://huggingface.co/ycffm/MBart50-legalform-remover