File size: 8,612 Bytes
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
a855a61
d8cfac3
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
license: mit
language:
- en
base_model:
- allenai/led-base-16384
pipeline_tag: text2text-generation
tags:
- chemistry
- materials
- workflow
- MAPs
- SDLs
- action-graphs
datasets:
- bruehle/SynthesisProcedures2ActionGraphs
---
# Model Card for Model LED-Base-16384_Chemtagger

<!-- Provide a quick summary of what the model is/does. -->

This model is part of [this](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) publication. It is used for translating chemical synthesis procedures given in natural language (en) to "action graphs", i.e., a simple markup language listing synthesis actions from a pre-defined controlled vocabulary along with the process parameters.

## Model Details

### Model Description

The model was fine-tuned on a dataset containing chemical synthesis procedures from the patent literature as input, and automatically generated annotations (action graphs) as output. The annotations were created using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), followed by post-processing and (semi-)automated cleanup.


- **Developed by:** Bastian Ruehle
- **Funded by:** [Federal Institute fo Materials Research and Testing (BAM)](https://www.bam.de)
- **Model type:** LED (Longformer Encoder-Decoder)
- **Language(s) (NLP):** en
- **License:** [MIT](https://opensource.org/license/mit)
- **Finetuned from model:** allenai/led-base-16384

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** The repository accompanying this model can be found [here](https://github.com/BAMresearch/MAPz_at_BAM/tree/main/Minerva-Workflow-Generator)
- **Paper:** The papers accompanying this model can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) and [here](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504)

## Uses

The model is integrated into a [node editor app](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) for generating workflows from synthesis procedures given in natural language for the Self-Driving Lab platform [Minerva](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504).

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

Even though it is not the intended way of using the model, it can be used "stand-alone" for creating action graphs from chemical synthesis procedures given in natural language (see below for a usage example).

### Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

The model was intended to be used with the [node editor app](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) for the Self-Driving Lab platform [Minerva](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504).

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

The model works best on synthesis procedures that are written in a style that is similar to the writing styles of synthesis procedures in patents and the experimental sections of scientific journals from the general fields of chemistry (organic, inorganic, materials science).

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The model might produce inaccurate results for procedures from other fields or procedures that cross-reference other pocedures, generic recipes, etc. 

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should always check the feasibility of the produced output before further processing it and running a chemical reaction based on the output.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import re


def preprocess(rawtext: str)->str:
    rawtext = rawtext.replace('( ', '(').replace(' )', ')').replace('[ ', '[').replace(' ]', ']').replace(' . ', '. ').replace(' , ', ', ').replace(' : ', ': ').replace(' ; ', '; ').replace('\r', ' ').replace('\n', ' ').replace('\t', '').replace('  ', ' ')
    rawtext = rawtext.replace('μ', 'u').replace('μ', 'u').replace('× ', 'x').replace('×', 'x')
    for m in re.finditer(r'[0-9]x\s[0-9]', rawtext):
        rawtext = rawtext.replace(m.group(), m.group().strip())
    return rawtext


if __name__ == '__main__':
    rawtext = """<Insert your Synthesis Procedure here>"""

    # model_id = 'bruehle/BigBirdPegasus_Llama'
    model_id = 'bruehle/LED-Base-16384_Llama'  # or use any of the other models
    # model_id = 'bruehle/BigBirdPegasus_Chemtagger'
    # model_id = 'bruehle/LED-Base-16384_Chemtagger'
    
    if 'BigBirdPegasus' in model_id:
        max_length = 512
    elif 'LED-Base-16384' in model_id:
        max_length = 1024
    
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map='auto')
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    pipe = pipeline('text2text-generation', model=model, tokenizer=tokenizer)

    print(pipe(preprocess(rawtext), max_new_tokens=max_length, do_sample=False, temperature=None, top_p=None)[0]['generated_text'])
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Models were trained on A100-80GB GPUs for 885’225 steps (5 epochs) on the training split, using a batch size of 8, an initial learning rate of 5*10-5 with a 0.05 warmup ratio, and a cosine weight decay. All other hyperparameters used the default values.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing

More information on data pre- and postprocessing can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G).


#### Training Hyperparameters

- **Training regime:** fp32 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

Example outputs for experimental procedures from the domains of materials science, organic chemistry, inorganic chemistry, and a patent that were not part of the training or evaluation dataset can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G).

## Technical Specifications

### Model Architecture and Objective

Longformer Encoder-Decoder Model for Text2Text/Seq2Seq Generation.

### Compute Infrastructure

Trained on HPC GPU nodes of the [Federal Institute fo Materials Research and Testing (BAM)](https://www.bam.de).

#### Hardware

A100, 80 GB GPU, Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz

#### Software

Python 3.12

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

@article{Ruehle_2025, title={Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs}, DOI={10.1039/D5DD00063G}, journal={DigitalDiscovery}, author={Ruehle, Bastian}, year={2025}}

@article{doi:10.1021/acsnano.4c17504, author = {Zaki, Mohammad and Prinz, Carsten and Ruehle, Bastian}, title = {A Self-Driving Lab for Nano- and Advanced Materials Synthesis}, journal = {ACS Nano}, volume = {19}, number = {9}, pages = {9029-9041}, year = {2025}, doi = {10.1021/acsnano.4c17504}, note ={PMID: 39995288}, URL = {https://doi.org/10.1021/acsnano.4c17504}, eprint = {https://doi.org/10.1021/acsnano.4c17504}}

**APA:**

Ruehle, B. (2025). Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs. DigitalDiscovery. doi:10.1039/D5DD00063G

Zaki, M., Prinz, C. & Ruehle, B. (2025). A Self-Driving Lab for Nano- and Advanced Materials Synthesis. ACS Nano, 19(9), 9029-9041. doi:10.1021/acsnano.4c17504

## Model Card Authors

Bastian Ruehle

## Model Card Contact

[email protected]