--- license: apache-2.0 --- # Model Card for Model ID This is the model card for https://huggingface.co/ycffm/MBart50-legalform-remover ## Model Details ### Model Description - **Developed by:** Cheng. Y - **Language(s) (NLP):** - For the language codes, please refer to mbart50 introduction - Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI) - **Finetuned from model:** facebook/mbart-large-50 Tang, Y., Tran, C., Li, X., Chen, P., Goyal, N., Chaudhary, V., Gu, J., & Fan, A. (2020). Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. Retrieved from https://arxiv.org/abs/2008.00401 ## Uses The MBart50 Legal Form Remover is a multilingual sequence-to-sequence model fine-tuned from Facebook’s mBART-50 for the task of legal form simplification in company names. This model is designed to process company names in multiple languages, identify and remove legal entity descriptors (e.g., “Inc.”, “LLC”, “GmbH”, “株式会社”), and output a simplified version of the name in English while preserving its core elements. The model intends to provide a way to remove the legal format from the original organization name and transform the company name into simplified English representation. This model is particularly developed for tasks like standardizing company names across multilingual datasets, simplifying legal documents, or preprocessing data for downstream tasks like entity resolution, search, and text analysis. ### Direct Use The model is directly used to standardize company names by removing legal forms and converting them to normalized English names. Example applications include: - Legal document simplification - Business name standardization in multilingual contexts - Data cleaning for entity resolution ```python from transformers import MBartForConditionalGeneration, MBart50TokenizerFast # Load the fine-tuned model and tokenizer model = MBartForConditionalGeneration.from_pretrained("ycffm/MBart50-legalform-remover") tokenizer = MBart50TokenizerFast.from_pretrained("ycffm/MBart50-legalform-remover") # Example usage: This might take several minutes, please be patient. Or you can download into your own repo and run it from there. test_cases = [ {"input_text": "Volkswagen Aktiengesellschaft", "source_lang": "de_DE"}, # German {"input_text": "株式会社日本カストディ銀行", "source_lang": "ja_XX"}, # Japanese {"input_text": "مركز الشدوحي", "source_lang": "ar_AR"}, # Arabic {"input_text": "L'Oréal Société Anonyme", "source_lang": "fr_XX"}, # French {"input_text": "삼성전자 주식회사", "source_lang": "ko_KR"}, # Korean {"input_text": "Koninklijke Philips N.V.", "source_lang": "nl_XX"}, # Dutch {"input_text": "浙江海氏实业集团有限公司", "source_lang": "zh_CN"}, # Simplified Chinese {"input_text": "현대자동차주식회사", "source_lang": "ko_KR"} # Korean ] for case in test_cases: input_text = case["input_text"] source_lang = case["source_lang"] # Set source language tokenizer.src_lang = source_lang # Tokenize input inputs = tokenizer(input_text, return_tensors="pt", max_length=64, truncation=True) # Generate output outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True) # Decode and print result print(f"Input: {input_text}") print(f"Predicted Simplified Name: {tokenizer.decode(outputs[0], skip_special_tokens=True)}") print("-" * 50) ``` ### Downstream Use [optional] Potential for fine-tuning or adaptation to other name normalization tasks across industries. ### Out-of-Scope Use This model is not intended for generating synthetic company names or translating entire sentences. Misuse for generating biased or offensive content is out of scope. ## Bias, Risks, and Limitations The model is mainly trained on EU languages, Arabic, Chinese (simplified/classical), Korean, Japanese and Russian. The performance on Arabic, however, could be compromised given the original ability of the base model and limited sample size for this language. ### Recommendations The author is actively searching for co-lab partners to improve the performance on the arabic and nordic European languages. ## How to Get Started with the Model You can use the code displayed above, or download the files from the directory and use it locally. ## Training Details ### Training Data real_companies_1 26764 --real company names in EU languages, Russian and various other languages usually in LATIN format. real_companies_2 24790 --real company names in EU languages, Russian and various other languages usually in LATIN format. real_companies_arabic 2317 --real company names in Arabic real_companies_ea 20328 --real company names in Chinese, Korean, Japanese synthetic_companies_eu 20000 --synthetic company names in EU languages The entire dataset was split in 8-1-1 as training-validation-testing set A typical data entry is 重庆海沛汽车保险代理有限公司 , Chongqing Haipai Automobile Insurance Agency , zh_CN as Original company name with legal format in source language, English company name without legal format, source language The target language is always en_XX ### Training Procedure Phase 1: Training on the synthetic data to capture the general legal form removal process. Phase 2: Training on the real world data to capture the multi-lingual legal form and its removal pattern. #### Training Hyperparameters - **Training regime:** [More Information Needed] • Training regime: Mixed precision (fp16) for speed and efficiency • Batch size: 8 • Learning rate: 5e-5 • Epochs: 2 per phase #### Metrics Cross Entropy as Loss, Perplexity ### Results - Validation Loss: 0.044 (Phase 2) - Perplexity: 1.045 (Phase 2) ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** T4 VRAM15gb - **Hours used:** 4 hours for phase 1, 3 hours for phase 2 - **Cloud Provider:** Google Colab ## Citation for fine-tuned model from this author This model is fine-tuned based on mbart50 **BibTeX:** @misc{mbart50-legalform-remover, title = {MBart50 Legal Form Remover}, author = {Cheng.Y}, year = {2024}, url = {https://huggingface.co/ycffm/MBart50-legalform-remover}, note = {Fine-tuned for multilingual legal form simplification tasks.} } **APA:** Cheng.Y. (2024). MBart50 Legal Form Remover. Retrieved from https://huggingface.co/ycffm/MBart50-legalform-remover