File size: 8,612 Bytes
d8cfac3 7b90e4f d8cfac3 a855a61 d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 7b90e4f d8cfac3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
license: mit
language:
- en
base_model:
- allenai/led-base-16384
pipeline_tag: text2text-generation
tags:
- chemistry
- materials
- workflow
- MAPs
- SDLs
- action-graphs
datasets:
- bruehle/SynthesisProcedures2ActionGraphs
---
# Model Card for Model LED-Base-16384_Chemtagger
<!-- Provide a quick summary of what the model is/does. -->
This model is part of [this](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) publication. It is used for translating chemical synthesis procedures given in natural language (en) to "action graphs", i.e., a simple markup language listing synthesis actions from a pre-defined controlled vocabulary along with the process parameters.
## Model Details
### Model Description
The model was fine-tuned on a dataset containing chemical synthesis procedures from the patent literature as input, and automatically generated annotations (action graphs) as output. The annotations were created using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), followed by post-processing and (semi-)automated cleanup.
- **Developed by:** Bastian Ruehle
- **Funded by:** [Federal Institute fo Materials Research and Testing (BAM)](https://www.bam.de)
- **Model type:** LED (Longformer Encoder-Decoder)
- **Language(s) (NLP):** en
- **License:** [MIT](https://opensource.org/license/mit)
- **Finetuned from model:** allenai/led-base-16384
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** The repository accompanying this model can be found [here](https://github.com/BAMresearch/MAPz_at_BAM/tree/main/Minerva-Workflow-Generator)
- **Paper:** The papers accompanying this model can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) and [here](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504)
## Uses
The model is integrated into a [node editor app](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) for generating workflows from synthesis procedures given in natural language for the Self-Driving Lab platform [Minerva](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504).
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Even though it is not the intended way of using the model, it can be used "stand-alone" for creating action graphs from chemical synthesis procedures given in natural language (see below for a usage example).
### Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
The model was intended to be used with the [node editor app](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) for the Self-Driving Lab platform [Minerva](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504).
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model works best on synthesis procedures that are written in a style that is similar to the writing styles of synthesis procedures in patents and the experimental sections of scientific journals from the general fields of chemistry (organic, inorganic, materials science).
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The model might produce inaccurate results for procedures from other fields or procedures that cross-reference other pocedures, generic recipes, etc.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should always check the feasibility of the produced output before further processing it and running a chemical reaction based on the output.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import re
def preprocess(rawtext: str)->str:
rawtext = rawtext.replace('( ', '(').replace(' )', ')').replace('[ ', '[').replace(' ]', ']').replace(' . ', '. ').replace(' , ', ', ').replace(' : ', ': ').replace(' ; ', '; ').replace('\r', ' ').replace('\n', ' ').replace('\t', '').replace(' ', ' ')
rawtext = rawtext.replace('μ', 'u').replace('μ', 'u').replace('× ', 'x').replace('×', 'x')
for m in re.finditer(r'[0-9]x\s[0-9]', rawtext):
rawtext = rawtext.replace(m.group(), m.group().strip())
return rawtext
if __name__ == '__main__':
rawtext = """<Insert your Synthesis Procedure here>"""
# model_id = 'bruehle/BigBirdPegasus_Llama'
model_id = 'bruehle/LED-Base-16384_Llama' # or use any of the other models
# model_id = 'bruehle/BigBirdPegasus_Chemtagger'
# model_id = 'bruehle/LED-Base-16384_Chemtagger'
if 'BigBirdPegasus' in model_id:
max_length = 512
elif 'LED-Base-16384' in model_id:
max_length = 1024
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
pipe = pipeline('text2text-generation', model=model, tokenizer=tokenizer)
print(pipe(preprocess(rawtext), max_new_tokens=max_length, do_sample=False, temperature=None, top_p=None)[0]['generated_text'])
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Models were trained on A100-80GB GPUs for 885’225 steps (5 epochs) on the training split, using a batch size of 8, an initial learning rate of 5*10-5 with a 0.05 warmup ratio, and a cosine weight decay. All other hyperparameters used the default values.
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing
More information on data pre- and postprocessing can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G).
#### Training Hyperparameters
- **Training regime:** fp32 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
Example outputs for experimental procedures from the domains of materials science, organic chemistry, inorganic chemistry, and a patent that were not part of the training or evaluation dataset can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G).
## Technical Specifications
### Model Architecture and Objective
Longformer Encoder-Decoder Model for Text2Text/Seq2Seq Generation.
### Compute Infrastructure
Trained on HPC GPU nodes of the [Federal Institute fo Materials Research and Testing (BAM)](https://www.bam.de).
#### Hardware
A100, 80 GB GPU, Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
#### Software
Python 3.12
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
@article{Ruehle_2025, title={Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs}, DOI={10.1039/D5DD00063G}, journal={DigitalDiscovery}, author={Ruehle, Bastian}, year={2025}}
@article{doi:10.1021/acsnano.4c17504, author = {Zaki, Mohammad and Prinz, Carsten and Ruehle, Bastian}, title = {A Self-Driving Lab for Nano- and Advanced Materials Synthesis}, journal = {ACS Nano}, volume = {19}, number = {9}, pages = {9029-9041}, year = {2025}, doi = {10.1021/acsnano.4c17504}, note ={PMID: 39995288}, URL = {https://doi.org/10.1021/acsnano.4c17504}, eprint = {https://doi.org/10.1021/acsnano.4c17504}}
**APA:**
Ruehle, B. (2025). Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs. DigitalDiscovery. doi:10.1039/D5DD00063G
Zaki, M., Prinz, C. & Ruehle, B. (2025). A Self-Driving Lab for Nano- and Advanced Materials Synthesis. ACS Nano, 19(9), 9029-9041. doi:10.1021/acsnano.4c17504
## Model Card Authors
Bastian Ruehle
## Model Card Contact
[email protected] |