Model Card for Model ID

This is the model card for https://huggingface.co/ycffm/MBart50-legalform-remover

Model Details

Model Description

Developed by: Cheng. Y
Language(s) (NLP):
For the language codes, please refer to mbart50 introduction
Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)
Finetuned from model: facebook/mbart-large-50 Tang, Y., Tran, C., Li, X., Chen, P., Goyal, N., Chaudhary, V., Gu, J., & Fan, A. (2020). Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. Retrieved from https://arxiv.org/abs/2008.00401

Uses

The MBart50 Legal Form Remover is a multilingual sequence-to-sequence model fine-tuned from Facebook’s mBART-50 for the task of legal form simplification in company names. This model is designed to process company names in multiple languages, identify and remove legal entity descriptors (e.g., “Inc.”, “LLC”, “GmbH”, “株式会社”), and output a simplified version of the name in English while preserving its core elements. The model intends to provide a way to remove the legal format from the original organization name and transform the company name into simplified English representation. This model is particularly developed for tasks like standardizing company names across multilingual datasets, simplifying legal documents, or preprocessing data for downstream tasks like entity resolution, search, and text analysis.

Direct Use

The model is directly used to standardize company names by removing legal forms and converting them to normalized English names. Example applications include:

Legal document simplification
Business name standardization in multilingual contexts
Data cleaning for entity resolution

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

# Load the fine-tuned model and tokenizer
model = MBartForConditionalGeneration.from_pretrained("ycffm/MBart50-legalform-remover")
tokenizer = MBart50TokenizerFast.from_pretrained("ycffm/MBart50-legalform-remover")

# Example usage: This might take several minutes, please be patient. Or you can download into your own repo and run it from there.
test_cases = [
    {"input_text": "Volkswagen Aktiengesellschaft", "source_lang": "de_DE"},  # German
    {"input_text": "株式会社日本カストディ銀行", "source_lang": "ja_XX"},         # Japanese
    {"input_text": "مركز الشدوحي", "source_lang": "ar_AR"},                   # Arabic
    {"input_text": "L'Oréal Société Anonyme", "source_lang": "fr_XX"},       # French
    {"input_text": "삼성전자 주식회사", "source_lang": "ko_KR"},                 # Korean
    {"input_text": "Koninklijke Philips N.V.", "source_lang": "nl_XX"},       # Dutch
    {"input_text": "浙江海氏实业集团有限公司", "source_lang": "zh_CN"},       # Simplified Chinese
    {"input_text": "현대자동차주식회사", "source_lang": "ko_KR"}       # Korean

]

for case in test_cases:
    input_text = case["input_text"]
    source_lang = case["source_lang"]

    # Set source language
    tokenizer.src_lang = source_lang

    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=64, truncation=True)

    # Generate output
    outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)

    # Decode and print result
    print(f"Input: {input_text}")
    print(f"Predicted Simplified Name: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
    print("-" * 50)

Downstream Use [optional]

Potential for fine-tuning or adaptation to other name normalization tasks across industries.

Out-of-Scope Use

This model is not intended for generating synthetic company names or translating entire sentences. Misuse for generating biased or offensive content is out of scope.

Bias, Risks, and Limitations

The model is mainly trained on EU languages, Arabic, Chinese (simplified/classical), Korean, Japanese and Russian. The performance on Arabic, however, could be compromised given the original ability of the base model and limited sample size for this language.

Recommendations

The author is actively searching for co-lab partners to improve the performance on the arabic and nordic European languages.

How to Get Started with the Model

You can use the code displayed above, or download the files from the directory and use it locally.

Training Details

Training Data

real_companies_1 26764 --real company names in EU languages, Russian and various other languages usually in LATIN format.

real_companies_2 24790 --real company names in EU languages, Russian and various other languages usually in LATIN format.

real_companies_arabic 2317 --real company names in Arabic

real_companies_ea 20328 --real company names in Chinese, Korean, Japanese

synthetic_companies_eu 20000 --synthetic company names in EU languages

The entire dataset was split in 8-1-1 as training-validation-testing set A typical data entry is

重庆海沛汽车保险代理有限公司 , Chongqing Haipai Automobile Insurance Agency , zh_CN as

Original company name with legal format in source language, English company name without legal format, source language The target language is always en_XX

Training Procedure

Phase 1: Training on the synthetic data to capture the general legal form removal process. Phase 2: Training on the real world data to capture the multi-lingual legal form and its removal pattern.

Training Hyperparameters

Training regime: [More Information Needed]

• Training regime: Mixed precision (fp16) for speed and efficiency • Batch size: 8 • Learning rate: 5e-5 • Epochs: 2 per phase

Metrics

Cross Entropy as Loss, Perplexity

Results

Validation Loss: 0.044 (Phase 2)
Perplexity: 1.045 (Phase 2)

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: T4 VRAM15gb
Hours used: 4 hours for phase 1, 3 hours for phase 2
Cloud Provider: Google Colab

Citation for fine-tuned model from this author

This model is fine-tuned based on mbart50

BibTeX:

@misc{mbart50-legalform-remover, title = {MBart50 Legal Form Remover}, author = {Cheng.Y}, year = {2024}, url = {https://huggingface.co/ycffm/MBart50-legalform-remover}, note = {Fine-tuned for multilingual legal form simplification tasks.} }

APA:

Cheng.Y. (2024). MBart50 Legal Form Remover. Retrieved from https://huggingface.co/ycffm/MBart50-legalform-remover