--- license: apache-2.0 pipeline_tag: text-generation widget: - text: <|endoftext|> inference: parameters: top_k: 950 repetition_penalty: 1.2 --- # **GPepT: A Language Model for Peptides and Peptidomimetics** GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for _de novo_ protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics. ## **Model Overview** GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL. To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using **Monomerizer** (available on GitHub). Detailed insights into the training process and datasets are provided in our accompanying publication. Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning. --- ## **Using GPepT for Sequence Generation** GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found [here](https://huggingface.co/docs/transformers/installation). The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements. ### **Example 1: Zero-Shot Sequence Generation** GPepT generates sequences that extend from a specified input token (e.g., `<|endoftext|>`). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example: ```python from transformers import pipeline # Initialize GPepT for text generation GPepT = pipeline('text-generation', model="Playingyoyo/GPepT") # Generate sequences (expressed in tokens, average ~4 amino acids per token) sequences = GPepT("<|endoftext|>", max_length=25, do_sample=True, top_k=950, repetition_penalty=1.5, num_return_sequences=5, eos_token_id=0) # Print generated sequences for seq in sequences: print(seq['generated_text']) ``` Sample output: ``` <|endoftext|>R K A L E Z1649 <|endoftext|>G K A L Z341 <|endoftext|>G V A G K X4097 V A P ``` --- ### **Example 2: Fine-Tuning for Directed Sequence Generation** Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data: 1. ```git clone https://github.com/tsudalab/Monomerizer/tree/main``` 2. ```cd Monomerizer``` 3. ```python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt```. Check the repo for the required format. 4. 3. will monomerize the SMILES and split the resulting sequences into training (`output/datetime/for_GPepT/train90.txt`) and validation (`output/datetime/for_GPepT/val10.txt`) files. To fine-tune the model: ```bash python run_clm.py --model_name_or_path Playingyoyo/GPepT \ --train_file path_to_train90.txt \ --validation_file path_to_val10.txt \ --tokenizer_name Playingyoyo/GPepT \ --do_train \ --do_eval \ --output_dir ./output \ --learning_rate 1e-5 ``` The fine-tuned model will be saved in the `./output` directory, ready to generate tailored sequences. --- ## **Selecting Valid Sequences** While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example: - **Invalid Sequences:** Those with terminal modifications (e.g., `Z`) embedded within the sequence. - **Valid Sequences:** Should adhere to standard peptidomimetic rules. By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study. --- GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces.