Kittawere's Chat with Protein version 1 (no-separation) - 3B param model (LoRA)

👋 Hi there! This is the second release in my project to chat with proteins

In this new version, amino acids are no longer space-separated. You simply input the raw protein sequence as a continuous string of letters.

What Does This Model Do?

✅ Predicts the phylogeny of a eukaryotic protein, classifying it as:

  • Plant (Viridiplantae)
  • Animal (Metazoa)
  • Fungus (Fungi)

Why Remove Amino Acid Separation?

My first model V1-separated used space-separated amino acids (e.g. <seq> T P P A G P D V G P R <seq>) to force the tokenizer to treat each residue as a separate token.

However, further testing showed:

  • No significant accuracy difference. The new model achieves ~80.57% accuracy versus ~79.2% for the separated version — not statistically significant on the held-out test set.
  • Non-separated sequences save tokens and avoid context-length issues for longer proteins.

So, my original reasoning for separation was not necessary for classification. However, I still plan to explore separated inputs in future work on protein generation, where residue-level control might be beneficial.

Training Data

  • Dataset: Entire Swiss-Prot database
  • Data processing:
    • Balanced samples of animals, plants, fungi
    • 80% training / 20% testing

Performance

  • Accuracy: ~80.57% on held-out test set
  • Baseline (random guess): 33%

This demonstrates that LLMs can directly work with protein sequences in a natural language context.

Input Format

Now you can simply paste the raw amino acid sequence into your prompt:

Example input:

<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?

Limitations

  • Phylogenetic predictions remain approximate; proteins may be shared across kingdoms.
  • Model context limits very long sequences.
  • Model is trained only on eukaryotic sequences (plants, animals, fungi).

License

Apache 2.0 (for the LoRA), refer to Meta’s license for the base weights.

Inference

Example usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kittawere/Llama-KW-CwP-V1-3B-notseparated"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"                # Automatically puts model on GPU(s)/CPU(s)
)

prompt = f"<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?"

def inference(model, tokenizer, prompt):
    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
    )

    # Get only the newly generated tokens
    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return answer

print(inference(model, tokenizer, prompt))

Want to help me?, reach me out or join my discord

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kittawere/Llama-KW-CwP-V1-3B-notseparated

Adapter
(306)
this model
Quantizations
1 model

Space using kittawere/Llama-KW-CwP-V1-3B-notseparated 1

Collection including kittawere/Llama-KW-CwP-V1-3B-notseparated