File size: 9,057 Bytes
3e79479 03745f9 37cd8ca c0bb917 37cd8ca 03745f9 91e8d4a 03745f9 3e79479 03745f9 ac472cf 37cd8ca ac472cf 47c836f 37cd8ca ac472cf 37cd8ca ac472cf 37cd8ca 47c836f 0dfc731 47c836f 1891c96 47c836f 37cd8ca 47c836f 37cd8ca ac472cf 47c836f 37cd8ca 47c836f 37cd8ca 47c836f ac472cf 37cd8ca 47c836f 37cd8ca ac472cf 30575fe ac472cf 30575fe ac472cf 30575fe ac472cf 30575fe ac472cf 37cd8ca 0dfc731 ac472cf 47c836f ac472cf 47c836f 37cd8ca 47c836f 37cd8ca ac472cf 37cd8ca 47c836f ac472cf 47c836f ac472cf 47c836f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
---
license: apache-2.0
tags:
- question-answering
- squad
- transformers
- nlp
datasets:
- squad
language:
- en
metrics:
- exact_match
- f1
library_name: transformers
pipeline_tag: question-answering
model-index:
- name: roberta-base-qa-v1
results:
- task:
type: question-answering
name: question-answering
dataset:
name: squad (a subset, not official dataset)
type: squad
metrics:
- type: f1
value: 78.28
name: f1
verified: false
- type: exact-match
value: 66.00
name: exact-match
verified: false
---
# Model card for SaraPiscitelli/roberta-base-qa-v1
This model is a **finetuned** model starting from the base transformer model [roberta-base](https://huggingface.co/roberta-base).
This model is finetuned on **extractive question answering** task using [squad dataset](https://huggingface.co/datasets/squad).
You can access the training code [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py) and the evaluation code [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).
### Model Description
- **Developed by:** Sara Piscitelli
- **Model type:** Transformer Encoder - RobertaBaseForQuestionAnswering (124.056.578 params)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
- **Maximum input tokens:** 512
### Model Sources
- **training code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py)
- **evaluation code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).
## Uses
The model can be utilized for the extractive question-answering task, where both the context and the question are provide.
### Recommendations
This is a basic standard model; some results may be inaccurate.
Refer to the evaluation metrics for a better understanding of its performance.
## How to Get Started with the Model
You can use the Huggingface pipeline:
```
from transformers import pipeline
qa_model = pipeline("question-answering", model="SaraPiscitelli/roberta-base-qa-v1")
question = "Which name is also used to describe the Amazon rainforest in English?"
context = """The Amazon rainforest (Portuguese: Floresta Amaz么nica or Amaz么nia; Spanish: Selva Amaz贸nica, Amazon铆a or usually Amazonia; French: For锚t amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."""
print(qa_model(question = question, context = context)['answer'])
```
or load it directly:
```
import torch
from typing import List, Optional
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
class InferenceModel:
def __init__(self, model_name_or_checkpoin_path: str,
tokenizer_name: Optional[str] = None,
device_type: Optional[str] = None) -> List[str]:
if tokenizer_name is None:
tokenizer_name = model_name_or_checkpoin_path
if device_type is None:
device_type = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
self.model = AutoModelForQuestionAnswering.from_pretrained(model_name_or_checkpoin_path, device_map=device_type)
self.model.eval()
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_checkpoin_path)
def inference(self, questions: List[str], contexts: List[str]) -> List[str]:
inputs = self.tokenizer(questions, contexts,
padding="longest",
return_tensors="pt").to(self.model.device)
with torch.no_grad():
logits = self.model(**inputs)
# logits.start_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
# logits.end_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
answer_start_index: List[int] = logits.start_logits.argmax(dim=-1).tolist()
answer_end_index: List[int] = logits.end_logits.argmax(dim=-1).tolist()
answer_tokens: List[str] = [self.tokenizer.decode(inputs.input_ids[i, answer_start_index[i] : answer_end_index[i] + 1])
for i in range(len(questions))]
return answer_tokens
model = InferenceModel("SaraPiscitelli/roberta-base-qa-v1")
question = "Which name is also used to describe the Amazon rainforest in English?"
context = """The Amazon rainforest (Portuguese: Floresta Amaz么nica or Amaz么nia; Spanish: Selva Amaz贸nica, Amazon铆a or usually Amazonia; French: For锚t amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."""
print(model.inference(questions=[question], contexts=[context])[0])
```
In both cases, the answer will be printed out: "Amazonia or the Amazon Jungle"
## Training Details
### Training Data
- [squad dataset](https://huggingface.co/datasets/squad).
To retrieve the dataset, use the following code:
```
from datasets import load_dataset
squad = load_dataset("squad")
squad['train'] = squad['train'].select(range(30000))
squad['test'] = squad['validation']
squad['validation'] = squad['validation'].select(range(2000))
```
The dataset used after preprocessing is listed below:
- Train Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 8207
})
- Validation dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 637
})
#### Preprocessing
All samples with **more than 512 tokens have been removed**.
This was necessary due to the maximum input token limit accepted by the RoBERTa-base model.
#### Training Hyperparameters
- **Training regime:** fp32
- **base_model_name_or_path:** roberta-base
- **max_tokens_length:** 512
- **training_arguments:** TrainingArguments(
output_dir=results_dir,
num_train_epochs=5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=1,
learning_rate=0.00001,
lr_scheduler_type="linear",
optim="adamw_torch",
eval_accumulation_steps=1,
evaluation_strategy="steps",
eval_steps=0.2,
save_strategy="steps",
save_steps=0.2,
logging_strategy="steps",
logging_steps=1,
report_to="tensorboard",
do_train=True,
do_eval=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
#group_by_length=True,
dataloader_drop_last=False,
fp16=False,
bf16=False
)
### Testing Data & Evaluation Metrics
#### Testing Data
To retrieve the dataset, use the following code:
```
from datasets import load_dataset
squad = load_dataset("squad")
squad['test'] = squad['validation']
```
Test Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 10570
})
#### Metrics
To evaluate model has been used the standard metric for squad:
```
import evaluate
metric_eval = evaluate.load("squad_v2")
```
### Results
{'exact-match': 66.00660066006601,
'f1': 78.28040573606134,
'total': 909,
'HasAns_exact': 66.00660066006601,
'HasAns_f1': 78.28040573606134,
'HasAns_total': 909,
'best_exact': 66.00660066006601,
'best_exact_thresh': 0.0,
'best_f1': 78.28040573606134,
'best_f1_thresh': 0.0}
|