--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: fill-mask tags: - climate - biology --- # Model Card for Model ID This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery. ## Model Details - RoBERTa as base model - Custom tokenizer - 125M parameters - Masked Language Modeling (MLM) pretraining strategy ### Model Description ## Uses - Named Entity Recognition (NER), Information revreival, sentence-transformers. ## Training Details ### Training Data The model was trained on the following datasets: 1. Wikipedia English dump of February 1, 2020 2. NASA own data 3. NASA papers 4. NASA Earth Science papers 5. NASA Astrophysics Data System 6. PubMed abstract 7. PMC : subset with commercial license The sizes of the dataset is shown in the following chart. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/CTNkn0WHS268hvidFmoqj.png) ### Training Procedure The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used. ## Evaluation ### BLURB Benchmark ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/K0IpQnTQmrfQJ1JXxn1B6.png) ### Pruned SQuAD2.0 (SQ2) Benchmark ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/R4oMJquUz4puah3lvd5Ve.png) ### NASA SMD Experts Benchmark WIP! ## Citation Please use the DOI provided by Huggingface to cite the model. ## Model Card Authors [optional] Bishwaranjan Bhattacharjee, IBM Research Muthukumaran Ramasubramanian, NASA-IMPACT (mr0051@uah.edu) ## Model Card Contact Muthukumaran Ramasubramanian (mr0051@uah.edu)