File size: 8,612 Bytes

d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
a855a61
d8cfac3
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b90e4f
d8cfac3
 
 
 
 
7b90e4f
d8cfac3

---
license: mit
language:
- en
base_model:
- allenai/led-base-16384
pipeline_tag: text2text-generation
tags:
- chemistry
- materials
- workflow
- MAPs
- SDLs
- action-graphs
datasets:
- bruehle/SynthesisProcedures2ActionGraphs
---
# Model Card for Model LED-Base-16384_Chemtagger

<!-- Provide a quick summary of what the model is/does. -->

This model is part of [this](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) publication. It is used for translating chemical synthesis procedures given in natural language (en) to "action graphs", i.e., a simple markup language listing synthesis actions from a pre-defined controlled vocabulary along with the process parameters.

## Model Details

### Model Description

The model was fine-tuned on a dataset containing chemical synthesis procedures from the patent literature as input, and automatically generated annotations (action graphs) as output. The annotations were created using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), followed by post-processing and (semi-)automated cleanup.


- **Developed by:** Bastian Ruehle
- **Funded by:** [Federal Institute fo Materials Research and Testing (BAM)](https://www.bam.de)
- **Model type:** LED (Longformer Encoder-Decoder)
- **Language(s) (NLP):** en
- **License:** [MIT](https://opensource.org/license/mit)
- **Finetuned from model:** allenai/led-base-16384

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** The repository accompanying this model can be found [here](https://github.com/BAMresearch/MAPz_at_BAM/tree/main/Minerva-Workflow-Generator)
- **Paper:** The papers accompanying this model can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) and [here](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504)

## Uses

The model is integrated into a [node editor app](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) for generating workflows from synthesis procedures given in natural language for the Self-Driving Lab platform [Minerva](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504).

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

Even though it is not the intended way of using the model, it can be used "stand-alone" for creating action graphs from chemical synthesis procedures given in natural language (see below for a usage example).

### Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

The model was intended to be used with the [node editor app](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G) for the Self-Driving Lab platform [Minerva](https://pubs.acs.org/doi/full/10.1021/acsnano.4c17504).

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

The model works best on synthesis procedures that are written in a style that is similar to the writing styles of synthesis procedures in patents and the experimental sections of scientific journals from the general fields of chemistry (organic, inorganic, materials science).

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The model might produce inaccurate results for procedures from other fields or procedures that cross-reference other pocedures, generic recipes, etc. 

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should always check the feasibility of the produced output before further processing it and running a chemical reaction based on the output.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import re


def preprocess(rawtext: str)->str:
    rawtext = rawtext.replace('( ', '(').replace(' )', ')').replace('[ ', '[').replace(' ]', ']').replace(' . ', '. ').replace(' , ', ', ').replace(' : ', ': ').replace(' ; ', '; ').replace('\r', ' ').replace('\n', ' ').replace('\t', '').replace('  ', ' ')
    rawtext = rawtext.replace('μ', 'u').replace('μ', 'u').replace('× ', 'x').replace('×', 'x')
    for m in re.finditer(r'[0-9]x\s[0-9]', rawtext):
        rawtext = rawtext.replace(m.group(), m.group().strip())
    return rawtext


if __name__ == '__main__':
    rawtext = """<Insert your Synthesis Procedure here>"""

    # model_id = 'bruehle/BigBirdPegasus_Llama'
    model_id = 'bruehle/LED-Base-16384_Llama'  # or use any of the other models
    # model_id = 'bruehle/BigBirdPegasus_Chemtagger'
    # model_id = 'bruehle/LED-Base-16384_Chemtagger'
    
    if 'BigBirdPegasus' in model_id:
        max_length = 512
    elif 'LED-Base-16384' in model_id:
        max_length = 1024
    
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map='auto')
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    pipe = pipeline('text2text-generation', model=model, tokenizer=tokenizer)

    print(pipe(preprocess(rawtext), max_new_tokens=max_length, do_sample=False, temperature=None, top_p=None)[0]['generated_text'])
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Models were trained on A100-80GB GPUs for 885’225 steps (5 epochs) on the training split, using a batch size of 8, an initial learning rate of 5*10-5 with a 0.05 warmup ratio, and a cosine weight decay. All other hyperparameters used the default values.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing

More information on data pre- and postprocessing can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G).


#### Training Hyperparameters

- **Training regime:** fp32 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

Example outputs for experimental procedures from the domains of materials science, organic chemistry, inorganic chemistry, and a patent that were not part of the training or evaluation dataset can be found [here](https://pubs.rsc.org/en/Content/ArticleLanding/2025/DD/D5DD00063G).

## Technical Specifications

### Model Architecture and Objective

Longformer Encoder-Decoder Model for Text2Text/Seq2Seq Generation.

### Compute Infrastructure

Trained on HPC GPU nodes of the [Federal Institute fo Materials Research and Testing (BAM)](https://www.bam.de).

#### Hardware

A100, 80 GB GPU, Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz

#### Software

Python 3.12

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

@article{Ruehle_2025, title={Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs}, DOI={10.1039/D5DD00063G}, journal={DigitalDiscovery}, author={Ruehle, Bastian}, year={2025}}

@article{doi:10.1021/acsnano.4c17504, author = {Zaki, Mohammad and Prinz, Carsten and Ruehle, Bastian}, title = {A Self-Driving Lab for Nano- and Advanced Materials Synthesis}, journal = {ACS Nano}, volume = {19}, number = {9}, pages = {9029-9041}, year = {2025}, doi = {10.1021/acsnano.4c17504}, note ={PMID: 39995288}, URL = {https://doi.org/10.1021/acsnano.4c17504}, eprint = {https://doi.org/10.1021/acsnano.4c17504}}

**APA:**

Ruehle, B. (2025). Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs. DigitalDiscovery. doi:10.1039/D5DD00063G

Zaki, M., Prinz, C. & Ruehle, B. (2025). A Self-Driving Lab for Nano- and Advanced Materials Synthesis. ACS Nano, 19(9), 9029-9041. doi:10.1021/acsnano.4c17504

## Model Card Authors

Bastian Ruehle

## Model Card Contact

[email protected]