Model Card for Model ID
This is the model card for https://huggingface.co/ycffm/MBart50-legalform-remover
Model Details
Model Description
- Developed by: Cheng. Y
- Language(s) (NLP):
- For the language codes, please refer to mbart50 introduction
- Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)
- Finetuned from model: facebook/mbart-large-50 Tang, Y., Tran, C., Li, X., Chen, P., Goyal, N., Chaudhary, V., Gu, J., & Fan, A. (2020). Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. Retrieved from https://arxiv.org/abs/2008.00401
Uses
The MBart50 Legal Form Remover is a multilingual sequence-to-sequence model fine-tuned from Facebook’s mBART-50 for the task of legal form simplification in company names. This model is designed to process company names in multiple languages, identify and remove legal entity descriptors (e.g., “Inc.”, “LLC”, “GmbH”, “株式会社”), and output a simplified version of the name in English while preserving its core elements. The model intends to provide a way to remove the legal format from the original organization name and transform the company name into simplified English representation. This model is particularly developed for tasks like standardizing company names across multilingual datasets, simplifying legal documents, or preprocessing data for downstream tasks like entity resolution, search, and text analysis.
Direct Use
The model is directly used to standardize company names by removing legal forms and converting them to normalized English names. Example applications include:
- Legal document simplification
- Business name standardization in multilingual contexts
- Data cleaning for entity resolution
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
# Load the fine-tuned model and tokenizer
model = MBartForConditionalGeneration.from_pretrained("ycffm/MBart50-legalform-remover")
tokenizer = MBart50TokenizerFast.from_pretrained("ycffm/MBart50-legalform-remover")
# Example usage: This might take several minutes, please be patient. Or you can download into your own repo and run it from there.
test_cases = [
{"input_text": "Volkswagen Aktiengesellschaft", "source_lang": "de_DE"}, # German
{"input_text": "株式会社日本カストディ銀行", "source_lang": "ja_XX"}, # Japanese
{"input_text": "مركز الشدوحي", "source_lang": "ar_AR"}, # Arabic
{"input_text": "L'Oréal Société Anonyme", "source_lang": "fr_XX"}, # French
{"input_text": "삼성전자 주식회사", "source_lang": "ko_KR"}, # Korean
{"input_text": "Koninklijke Philips N.V.", "source_lang": "nl_XX"}, # Dutch
{"input_text": "浙江海氏实业集团有限公司", "source_lang": "zh_CN"}, # Simplified Chinese
{"input_text": "현대자동차주식회사", "source_lang": "ko_KR"} # Korean
]
for case in test_cases:
input_text = case["input_text"]
source_lang = case["source_lang"]
# Set source language
tokenizer.src_lang = source_lang
# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt", max_length=64, truncation=True)
# Generate output
outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
# Decode and print result
print(f"Input: {input_text}")
print(f"Predicted Simplified Name: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
print("-" * 50)
Downstream Use [optional]
Potential for fine-tuning or adaptation to other name normalization tasks across industries.
Out-of-Scope Use
This model is not intended for generating synthetic company names or translating entire sentences. Misuse for generating biased or offensive content is out of scope.
Bias, Risks, and Limitations
The model is mainly trained on EU languages, Arabic, Chinese (simplified/classical), Korean, Japanese and Russian. The performance on Arabic, however, could be compromised given the original ability of the base model and limited sample size for this language.
Recommendations
The author is actively searching for co-lab partners to improve the performance on the arabic and nordic European languages.
How to Get Started with the Model
You can use the code displayed above, or download the files from the directory and use it locally.
Training Details
Training Data
real_companies_1 26764 --real company names in EU languages, Russian and various other languages usually in LATIN format.
real_companies_2 24790 --real company names in EU languages, Russian and various other languages usually in LATIN format.
real_companies_arabic 2317 --real company names in Arabic
real_companies_ea 20328 --real company names in Chinese, Korean, Japanese
synthetic_companies_eu 20000 --synthetic company names in EU languages
The entire dataset was split in 8-1-1 as training-validation-testing set A typical data entry is
重庆海沛汽车保险代理有限公司 , Chongqing Haipai Automobile Insurance Agency , zh_CN as
Original company name with legal format in source language, English company name without legal format, source language The target language is always en_XX
Training Procedure
Phase 1: Training on the synthetic data to capture the general legal form removal process. Phase 2: Training on the real world data to capture the multi-lingual legal form and its removal pattern.
Training Hyperparameters
Training regime: [More Information Needed]
• Training regime: Mixed precision (fp16) for speed and efficiency • Batch size: 8 • Learning rate: 5e-5 • Epochs: 2 per phase
Metrics
Cross Entropy as Loss, Perplexity
Results
- Validation Loss: 0.044 (Phase 2)
- Perplexity: 1.045 (Phase 2)
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: T4 VRAM15gb
- Hours used: 4 hours for phase 1, 3 hours for phase 2
- Cloud Provider: Google Colab
Citation for fine-tuned model from this author
This model is fine-tuned based on mbart50
BibTeX:
@misc{mbart50-legalform-remover, title = {MBart50 Legal Form Remover}, author = {Cheng.Y}, year = {2024}, url = {https://huggingface.co/ycffm/MBart50-legalform-remover}, note = {Fine-tuned for multilingual legal form simplification tasks.} }
APA:
Cheng.Y. (2024). MBart50 Legal Form Remover. Retrieved from https://huggingface.co/ycffm/MBart50-legalform-remover
- Downloads last month
- 2