---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: fill-mask
tags:
- climate
- biology
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery.

## Model Details
- RoBERTa as base model
- Custom tokenizer
- 125M parameters
- Masked Language Modeling (MLM) pretraining strategy

### Model Description

<!-- - **Developed by:** NASA IMPACT and IBM Research
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** [More Information Needed] -->

## Uses

- Named Entity Recognition (NER), Information revreival, sentence-transformers.

## Training Details

### Training Data

The model was trained on the following datasets: 
1.  Wikipedia English dump of February 1, 2020 
2.  NASA own data 
3.  NASA papers 
4.  NASA Earth Science papers 
5.  NASA Astrophysics Data System
6.  PubMed abstract  
7.  PMC : subset with commercial license 

 The sizes of the dataset is shown in the following chart.
 
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/CTNkn0WHS268hvidFmoqj.png)

<!-- Provide the basic links for the model. 

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
-->

### Training Procedure 
The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used.
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

## Evaluation

### BLURB Benchmark

![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/K0IpQnTQmrfQJ1JXxn1B6.png)


### Pruned SQuAD2.0 (SQ2) Benchmark


![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/R4oMJquUz4puah3lvd5Ve.png)

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->


### NASA SMD Experts Benchmark

WIP!

## Citation

Please use the DOI provided by Huggingface to cite the model.

## Model Card Authors [optional]

Bishwaranjan Bhattacharjee, IBM Research
Muthukumaran Ramasubramanian, NASA-IMPACT (mr0051@uah.edu)

## Model Card Contact

Muthukumaran Ramasubramanian (mr0051@uah.edu)