KAILAS / README.md
Muthukumaran's picture
Update README.md
f01d42f
|
raw
history blame
2.9 kB
metadata
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: fill-mask
tags:
  - climate
  - biology

Model Card for Model ID

This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery.

Model Details

  • RoBERTa as base model
  • Custom tokenizer
  • 125M parameters
  • Masked Language Modeling (MLM) pretraining strategy

Model Description

Uses

  • Named Entity Recognition (NER), Information revreival, sentence-transformers.

Training Details

Training Data

The model was trained on the following datasets:

  1. Wikipedia English dump of February 1, 2020
  2. NASA own data
  3. NASA papers
  4. NASA Earth Science papers
  5. NASA Astrophysics Data System
  6. PubMed abstract
  7. PMC : subset with commercial license

The sizes of the dataset is shown in the following chart.

image/png

Training Procedure

The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used.

Evaluation

BLURB Benchmark

image/png

Pruned SQuAD2.0 (SQ2) Benchmark

image/png

NASA SMD Experts Benchmark

WIP!

Citation

Please use the DOI provided by Huggingface to cite the model.

Model Card Authors [optional]

Bishwaranjan Bhattacharjee, IBM Research Muthukumaran Ramasubramanian, NASA-IMPACT ([email protected])

Model Card Contact

Muthukumaran Ramasubramanian ([email protected])