File size: 4,499 Bytes
73c0b68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98ca924
73c0b68
 
 
 
75a9cf3
73c0b68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75a9cf3
 
 
 
 
b5e8079
75a9cf3
 
b5e8079
 
 
75a9cf3
73c0b68
 
 
 
 
 
 
 
 
 
a4fe667
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73c0b68
 
 
 
 
 
 
 
 
 
 
 
6ace2d9
 
 
73c0b68
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
language: en
license: mit
library_name: transformers
tags:
  - climate-change
  - domain-adaptation
  - masked-language-modeling
  - scientific-nlp
  - transformer
  - BERT
  - ClimateBERT
metrics:
  - f1
model-index:
  - name: SciClimateBERT
    results:
      - task:
          type: text-classification
          name: Climate NLP Tasks (ClimaBench)
        dataset:
          name: ClimaBench
          type: benchmark
        metrics:
          - type: f1
            name: Macro F1 (avg)
            value: 57.829
---

# SciClimateBERT 🌎🔬

**SciClimateBERT** is a domain-adapted version of [**ClimateBERT**](https://huggingface.co/climatebert/distilroberta-base-climate-f), further pretrained on peer-reviewed scientific papers focused on climate change. While ClimateBERT is tuned for general climate-related text, SciClimateBERT narrows the focus to high-quality academic content, improving performance in scientific NLP applications.

## 🔍 Overview

- **Base Model**: ClimateBERT (RoBERTa-based architecture)
- **Pretraining Method**: Continued pretraining (domain adaptation) with Masked Language Modeling (MLM)
- **Corpus**: Scientific climate change literature from top-tier journals
- **Tokenizer**: ClimateBERT tokenizer (unchanged)
- **Language**: English
- **Domain**: Scientific climate change research

## 📊 Performance

Evaluated on **ClimaBench**, a benchmark suite for climate-focused NLP tasks:

| Metric         | Value        |
|----------------|--------------|
| Macro F1 (avg) | 57.83|
| Tasks won      | 0/7  |
| Avg. Std Dev   | 0.01747|

While based on ClimateBERT, this model focuses on structured scientific input, making it ideal for downstream applications in climate science and research automation.

Climate performance model card:
|SciClimateBERT||
|---------------------------------|-----------------------------|
| 1. Model publicly available?    | Yes                         |
| 2. Time to train final model    |300h                        | 
| 3. Time for all experiments     | 1,226h ~ 51 days       |
| 4. Power of GPU and CPU         | 0.250 kW + 0.013 kW         |
| 5. Location for computations    | Croatia                     |
| 6. Energy mix at location       | 224.71 gCO<sub>2</sub>eq/kWh        |
| 7. CO$_2$eq for final model     | 18 kg CO<sub>2</sub>  |
| 8. CO$_2$eq for all experiments | 74 kg CO<sub>2</sub>                |

## 🧪 Intended Uses

**Use for:**
- Scientific climate change text classification and extraction
- Knowledge base and graph construction in climate policy and research domains

**Not suitable for:**
- Non-scientific general-purpose text
- Multilingual applications

Example:
``` python
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the <mask> balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']} — {p['score']:.4f}")
```
Output:
``` shell
The increase in greenhouse gas emissions has significantly affected the <mask> balance of the Earth.
>>>>>>>>>>
The increase in greenhouse gas ... affected the energy balance of the Earth. — 0.7897
The increase in greenhouse gas ... affected the radiation balance of the Earth. — 0.0522
The increase in greenhouse gas ... affected the mass balance of the Earth. — 0.0401
The increase in greenhouse gas ... affected the water balance of the Earth. — 0.0359
The increase in greenhouse gas ... affected the carbon balance of the Earth. — 0.0190
```

## ⚠️ Limitations
- May reflect scientific publication biases

## 🧾 Citation

If you use this model, please cite:

```bibtex
@article{poleksic_etal_2025,
  title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
  author={Poleksić, Andrija  and
      Martinčić-Ipšić, Sanda},
  journal={PREPRINT (Version 1)},
  year={2025},
  doi={https://doi.org/10.21203/rs.3.rs-6644722/v1}
}