adsabs
/

KAILAS

Token Classification

Model card Files Files and versions Community

KAILAS / README.md

Muthukumaran's picture

Update README.md

f01d42f about 1 year ago

|

2.9 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: fill-mask
	tags:
	- climate
	- biology
	---
	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery.

	## Model Details
	- RoBERTa as base model
	- Custom tokenizer
	- 125M parameters
	- Masked Language Modeling (MLM) pretraining strategy

	### Model Description

	<!-- - Developed by: NASA IMPACT and IBM Research
	- Funded by [optional]: [More Information Needed]
	- Shared by [optional]: [More Information Needed]
	- Model type: [More Information Needed]
	- Language(s) (NLP): [More Information Needed]
	- License: [More Information Needed]
	- Finetuned from model [optional]: [More Information Needed] -->

	## Uses

	- Named Entity Recognition (NER), Information revreival, sentence-transformers.

	## Training Details

	### Training Data

	The model was trained on the following datasets:
	1. Wikipedia English dump of February 1, 2020
	2. NASA own data
	3. NASA papers
	4. NASA Earth Science papers
	5. NASA Astrophysics Data System
	6. PubMed abstract
	7. PMC : subset with commercial license

	The sizes of the dataset is shown in the following chart.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/CTNkn0WHS268hvidFmoqj.png)

	<!-- Provide the basic links for the model.

	- Repository: [More Information Needed]
	- Paper [optional]: [More Information Needed]
	- Demo [optional]: [More Information Needed]
	-->

	### Training Procedure
	The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used.
	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	## Evaluation

	### BLURB Benchmark

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/K0IpQnTQmrfQJ1JXxn1B6.png)


	### Pruned SQuAD2.0 (SQ2) Benchmark


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/R4oMJquUz4puah3lvd5Ve.png)

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->


	### NASA SMD Experts Benchmark

	WIP!

	## Citation

	Please use the DOI provided by Huggingface to cite the model.

	## Model Card Authors [optional]

	Bishwaranjan Bhattacharjee, IBM Research
	Muthukumaran Ramasubramanian, NASA-IMPACT ([email protected])

	## Model Card Contact

	Muthukumaran Ramasubramanian ([email protected])