|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: fill-mask |
|
tags: |
|
- climate |
|
- biology |
|
--- |
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery. |
|
|
|
## Model Details |
|
- RoBERTa as base model |
|
- Custom tokenizer |
|
- 125M parameters |
|
- Masked Language Modeling (MLM) pretraining strategy |
|
|
|
### Model Description |
|
|
|
<!-- - **Developed by:** NASA IMPACT and IBM Research |
|
- **Funded by [optional]:** [More Information Needed] |
|
- **Shared by [optional]:** [More Information Needed] |
|
- **Model type:** [More Information Needed] |
|
- **Language(s) (NLP):** [More Information Needed] |
|
- **License:** [More Information Needed] |
|
- **Finetuned from model [optional]:** [More Information Needed] --> |
|
|
|
## Uses |
|
|
|
- Named Entity Recognition (NER), Information revreival, sentence-transformers. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on the following datasets: |
|
1. Wikipedia English dump of February 1, 2020 |
|
2. NASA own data |
|
3. NASA papers |
|
4. NASA Earth Science papers |
|
5. NASA Astrophysics Data System |
|
6. PubMed abstract |
|
7. PMC : subset with commercial license |
|
|
|
The sizes of the dataset is shown in the following chart. |
|
|
|
 |
|
|
|
<!-- Provide the basic links for the model. |
|
|
|
- **Repository:** [More Information Needed] |
|
- **Paper [optional]:** [More Information Needed] |
|
- **Demo [optional]:** [More Information Needed] |
|
--> |
|
|
|
### Training Procedure |
|
The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used. |
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
## Evaluation |
|
|
|
### BLURB Benchmark |
|
|
|
 |
|
|
|
|
|
### Pruned SQuAD2.0 (SQ2) Benchmark |
|
|
|
|
|
 |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
|
|
### NASA SMD Experts Benchmark |
|
|
|
WIP! |
|
|
|
## Citation |
|
|
|
Please use the DOI provided by Huggingface to cite the model. |
|
|
|
## Model Card Authors [optional] |
|
|
|
Bishwaranjan Bhattacharjee, IBM Research |
|
Muthukumaran Ramasubramanian, NASA-IMPACT ([email protected]) |
|
|
|
## Model Card Contact |
|
|
|
Muthukumaran Ramasubramanian ([email protected]) |