license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: fill-mask
tags:
- climate
- biology
Model Card for Model ID
This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery.
Model Details
- RoBERTa as base model
- Custom tokenizer
- 125M parameters
- Masked Language Modeling (MLM) pretraining strategy
Model Description
Uses
- Named Entity Recognition (NER), Information revreival, sentence-transformers.
Training Details
Training Data
The model was trained on the following datasets:
- Wikipedia English dump of February 1, 2020
- NASA own data
- NASA papers
- NASA Earth Science papers
- NASA Astrophysics Data System
- PubMed abstract
- PMC : subset with commercial license
The sizes of the dataset is shown in the following chart.
Training Procedure
The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used.
Evaluation
BLURB Benchmark
Pruned SQuAD2.0 (SQ2) Benchmark
NASA SMD Experts Benchmark
WIP!
Citation
Please use the DOI provided by Huggingface to cite the model.
Model Card Authors [optional]
Bishwaranjan Bhattacharjee, IBM Research Muthukumaran Ramasubramanian, NASA-IMPACT ([email protected])
Model Card Contact
Muthukumaran Ramasubramanian ([email protected])