metadata

license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: fill-mask
tags:
  - climate
  - biology

Model Card for Model ID

This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery.

Model Details

RoBERTa as base model
Custom tokenizer
125M parameters
Masked Language Modeling (MLM) pretraining strategy

Model Description

Uses

Named Entity Recognition (NER), Information revreival, sentence-transformers.

Training Details

Training Data

The model was trained on the following datasets:

Wikipedia English dump of February 1, 2020
NASA own data
NASA papers
NASA Earth Science papers
NASA Astrophysics Data System
PubMed abstract
PMC : subset with commercial license

The sizes of the dataset is shown in the following chart.

Training Procedure

The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used.

Evaluation

BLURB Benchmark

Pruned SQuAD2.0 (SQ2) Benchmark

NASA SMD Experts Benchmark

WIP!

Citation

Please use the DOI provided by Huggingface to cite the model.

Model Card Authors [optional]

Bishwaranjan Bhattacharjee, IBM Research Muthukumaran Ramasubramanian, NASA-IMPACT ([email protected])

Model Card Contact

Muthukumaran Ramasubramanian ([email protected])