---
license: apache-2.0
pipeline_tag: text-generation
widget:
- text: <|endoftext|>
inference:
  parameters:
    top_k: 950
    repetition_penalty: 1.2
---

# **GPepT: A Language Model for Peptides and Peptidomimetics**

GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for _de novo_ protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics.

## **Model Overview**
GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL.

To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using **Monomerizer** (available on GitHub). Detailed insights into the training process and datasets are provided in our accompanying publication.

Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning.

---

## **Using GPepT for Sequence Generation**
GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found [here](https://huggingface.co/docs/transformers/installation).

The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements.

### **Example 1: Zero-Shot Sequence Generation**
GPepT generates sequences that extend from a specified input token (e.g., `<|endoftext|>`). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example:

```python
from transformers import pipeline

# Initialize GPepT for text generation
GPepT = pipeline('text-generation', model="Playingyoyo/GPepT")

# Generate sequences (expressed in tokens, average ~4 amino acids per token)
sequences = GPepT("<|endoftext|>", 
                   max_length=25, 
                   do_sample=True, 
                   top_k=950, 
                   repetition_penalty=1.5, 
                   num_return_sequences=5, 
                   eos_token_id=0)

# Print generated sequences
for seq in sequences:
    print(seq['generated_text'])
```

Sample output:
```
<|endoftext|>R K A L E Z1649
<|endoftext|>G K A L Z341
<|endoftext|>G V A G K X4097 V A P
```

---

### **Example 2: Fine-Tuning for Directed Sequence Generation**
Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data:
1. ```git clone https://github.com/tsudalab/Monomerizer/tree/main```
2. ```cd Monomerizer```
3. ```python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt```. Check the repo for the required format.
4. 3. will monomerize the SMILES and split the resulting sequences into training (`output/datetime/for_GPepT/train90.txt`) and validation (`output/datetime/for_GPepT/val10.txt`) files.

To fine-tune the model:

```bash
python run_clm.py --model_name_or_path Playingyoyo/GPepT \
                  --train_file path_to_train90.txt \
                  --validation_file path_to_val10.txt \
                  --tokenizer_name Playingyoyo/GPepT \
                  --do_train \
                  --do_eval \
                  --output_dir ./output \
                  --learning_rate 1e-5
```

The fine-tuned model will be saved in the `./output` directory, ready to generate tailored sequences.

---

## **Selecting Valid Sequences**
While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example:
- **Invalid Sequences:** Those with terminal modifications (e.g., `Z`) embedded within the sequence.
- **Valid Sequences:** Should adhere to standard peptidomimetic rules.

By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study.

--- 

GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces.