Spaces:

fneurociencias
/

GeneForgeLang

Sleeping

App Files Files Community

ManMenGon commited on Apr 23

Commit

1291f55

verified ·

1 Parent(s): 6ef5528

Upload 7 files

Browse files

Files changed (5) hide show

README.md +106 -81
app.py +51 -134
generar_desde_frase_input_v2.py +52 -0
generar_interactivo.py +35 -0
requirements.txt +3 -3

README.md CHANGED Viewed

@@ -1,81 +1,106 @@
----
-title: GeneForgeLang
-emoji: 🧬
-colorFrom: indigo
-colorTo: blue
-sdk: gradio
-sdk_version: "3.50.2"
-app_file: app.py
-pinned: true
----
-# 🧬 GeneForgeLang: Symbolic-to-Sequence & Cross-Modality Biomolecular Design Toolkit
-**GeneForgeLang** is a symbolic, generative language that allows scientists to design and interpret DNA, RNA, and protein sequences with unified syntax and AI support.
-This toolkit enables:
-- Generation of realistic proteins from symbolic design
-- Translation of symbolic phrases across DNA ↔ RNA ↔ Protein
-- Structured, human-readable and AI-trainable syntax
-- Semantic equivalence across molecular layers
----
-## 🚀 Features
-| Module                      | Description |
-|----------------------------|-------------|
-| 🧠 Phrase → Protein         | Generate realistic protein sequences from symbolic phrases |
-| 🔁 Transcode Across Molecules | Translate GeneForgeLang phrases between DNA, RNA, and Protein |
-| 📚 Universal Grammar        | One structure to rule them all: motifs, domains, PTMs, splicing |
-| 🧬 Compact Notation         | Prefixes, accents, and structural markers for efficiency |
-| 🧠 AI-Ready Output          | Compatible with transformer-based models like ProtGPT2 |
----
-## 🧪 Example Input Phrases
-### DNA → RNA
-```
-~d:Prom[TATA]-Exon1-Intr1-Exon2
-↓
-:r:Cap5'-Ex1-Ex2-UTR3'
-```
-### RNA → Protein
-```
-:r:Ex1-Ex2
-↓
-^p:Dom(Kin)-Mot(NLS)
-```
----
-## ▶️ How to Use
-1. Launch this app on Hugging Face or locally
-2. Choose your input phrase and source/target modality
-3. Get your translated output or generated protein
----
-## 📁 Files
-| File                  | Description |
-|-----------------------|-------------|
-| `app.py`              | Full Gradio app (all tabs) |
-| `semillas.json`       | Seed dictionary |
-| `transcoder.py`       | Script for DNA/RNA/protein conversion |
-| `requirements.txt`    | Python dependencies |
-| `README.md`           | This file |
----
-## 🧠 Developed by
-**Fundación de Neurociencias**
-MIT License
-Join us in shaping symbolic bio-AI.

+Gracias por compartir el `README.md` actual. Está bastante bien estructurado, pero podemos mejorarlo para que refleje **todos los módulos reales incluidos** y su utilidad práctica, y además reforzar el interés del proyecto de cara a visitantes y posibles contribuidores.
+A continuación te presento una versión revisada y ampliada, lista para reemplazar el contenido actual:
+---
+```markdown
+---
+title: GeneForgeLang
+emoji: 🧬
+colorFrom: indigo
+colorTo: blue
+sdk: gradio
+sdk_version: "3.50.2"
+app_file: app.py
+pinned: true
+---
+# 🧬 GeneForgeLang: Symbolic-to-Sequence & Cross-Modality Biomolecular Design Toolkit
+**GeneForgeLang** is a symbolic and generative language for cross-modality biomolecular design.
+It enables unified AI-powered workflows to **design, interpret and translate DNA, RNA, and protein sequences** using a compact, human-readable grammar.
+This project provides:
+- **A symbolic language** spanning all biological layers (genomic, transcriptomic, proteomic)
+- **Realistic sequence generation** via AI models like ProtGPT2
+- **Scientific interpretation** of symbolic phrases in natural language
+- **Cross-modality transcoders** (e.g., DNA → RNA → Protein and vice versa)
+- **An interactive Gradio-based UI** for easy use and integration
+---
+## 🚀 Key Features
+| Module                      | Description |
+|----------------------------|-------------|
+| 🧠 Phrase → Sequence        | Generate DNA, RNA, or protein from symbolic design |
+| 🔁 Transcode Phrases        | Translate GeneForgeLang phrases across modalities |
+| 📖 Phrase → Description     | Generate scientific English descriptions of symbolic inputs |
+| 🔄 Sequence → Phrase        | Infer functional phrases from real sequences |
+| 🧬 Mutate Sequence (WIP)    | Generate variants for symbolic seeds (under development) |
+| 📦 Export to FASTA (WIP)    | Save generated sequences to .fasta (to be implemented) |
+| 📊 Analyze Sequence (WIP)   | Visualize amino acid composition or base content |
+---
+## 🧪 Example Input Phrases
+```text
+~d:Prom[TATA]-Exon1-Intr1-Exon2
+↓
+:r:Cap5'-Ex1-Ex2-UTR3'
+↓
+^p:Dom(Kin)-Mot(NLS)*AcK@147
+```
+---
+## ▶️ How to Use Locally
+1. Clone this repo:
+```bash
+git clone https://github.com/Fundacion-de-Neurociencias/GeneForgeLang.git
+cd GeneForgeLang
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. Launch the interface:
+```bash
+python app.py
+```
+4. Navigate to:
+[http://127.0.0.1:7860](http://127.0.0.1:7860)
+---
+## 📁 File Structure
+| File                          | Description |
+|------------------------------|-------------|
+| `app.py`                     | Full Gradio app (4 tabs) |
+| `semillas.json`              | Phrase-to-seed dictionary |
+| `generate_from_phrase.py`    | Symbolic-to-sequence generator |
+| `describe_phrase.py`         | Phrase interpreter to scientific English |
+| `translate_to_geneforgelang.py` | Sequence-to-symbolic phrase translation |
+| `transcoder.py`              | Modality switcher (DNA ↔ RNA ↔ Protein) |
+| `requirements.txt`           | Python dependencies |
+| `README.md`                  | This file |
+---
+## 🧠 Developed by
+**Fundación de Neurociencias**
+Licensed under the MIT License
+> Join us in shaping the future of symbolic bio-AI. Contributions welcome!
+```
+---

app.py CHANGED Viewed

@@ -1,141 +1,58 @@
 import gradio as gr
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
-import json
-import re
-import tempfile
-# Load symbolic phrase dictionary
-with open("semillas.json", "r", encoding="utf-8") as f:
-    diccionario_semillas = json.load(f)
-def phrase_to_seed(phrase):
-    phrase = phrase.lower()
-    for key, seed in diccionario_semillas.items():
-        if key.lower() in phrase:
-            return seed
-    return "M"
-tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2", do_lower_case=False)
-tokenizer.pad_token = tokenizer.eos_token
 model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
-def generate_protein_and_props(phrase):
-    seed = phrase_to_seed(phrase)
-    inputs = tokenizer(seed, return_tensors="pt", padding=True)
-    input_ids = inputs["input_ids"]
-    attention_mask = inputs.get("attention_mask", torch.ones_like(input_ids))
-    with torch.no_grad():
-        output = model.generate(
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            max_length=100,
-            min_length=20,
-            do_sample=True,
-            top_k=50,
-            temperature=0.9,
-            pad_token_id=tokenizer.eos_token_id,
-            num_return_sequences=1
-        )
-    seq = tokenizer.decode(output[0], skip_special_tokens=True)
-    # Calculate properties
-    length = len(seq)
-    aa_count = {aa: seq.count(aa) for aa in "ACDEFGHIKLMNPQRSTVWY"}
-    charge = sum([aa_count.get(a, 0) for a in "KR"]) - sum([aa_count.get(a, 0) for a in "DE"])
-    mw = sum([aa_count[a]*w for a, w in {
-        "A": 89.1, "C": 121.2, "D": 133.1, "E": 147.1, "F": 165.2,
-        "G": 75.1, "H": 155.2, "I": 131.2, "K": 146.2, "L": 131.2,
-        "M": 149.2, "N": 132.1, "P": 115.1, "Q": 146.2, "R": 174.2,
-        "S": 105.1, "T": 119.1, "V": 117.1, "W": 204.2, "Y": 181.2
-    }.items()])
-    props = f"🧪 Seed: {seed}\n🧬 Protein: {seq}\n\n🔬 Properties:\n- Length: {length} aa\n- Charge: {charge}\n- MW: {mw:.1f} Da"
-    # Save to FASTA
-    with tempfile.NamedTemporaryFile(delete=False, suffix=".fasta", mode="w", encoding="utf-8") as f:
-        f.write(f">Generated_Protein\n{seq}\n")
-        fasta_path = f.name
-    return props, fasta_path
-def sequence_to_phrase(seq):
-    seq = seq.upper()
-    tags = []
-    if re.search(r"^M*K{3,}", seq):
-        tags.append("Dom(Kin)")
-    if re.search(r"[RK]{3,}", seq):
-        tags.append("Mot(NLS)")
-    if len(re.findall(r"E", seq)) >= 5 or "DEG" in seq:
-        tags.append("Mot(PEST)")
-    if re.search(r"KQAK|QAK", seq):
-        tags.append("*AcK@X")
-    if re.search(r"[RST]P", seq):
-        tags.append("*P@X")
-    if "PRKRK" in seq or "PKKKRKV" in seq:
-        tags.append("Localize(Nucleus)")
-    if re.search(r"(AILFL|LAGGAV|LVLL|AAVL)", seq):
-        tags.append("Localize(Membrane)")
-    return "^p:" + "-".join(sorted(set(tags))) if tags else "// No symbolic motifs found"
-def phrase_to_description(phrase):
-    phrase = phrase.replace("^p:", "")
-    fragments = phrase.split("-")
-    translation = {
-        "Dom(Kin)": "a kinase domain",
-        "Mot(NLS)": "a nuclear localization signal",
-        "Mot(PEST)": "a PEST motif indicating protein degradation",
-        "*AcK@X": "lysine acetylation at a specific position",
-        "*P@X": "a phosphorylation site",
-        "Localize(Nucleus)": "localizes to the cell nucleus",
-        "Localize(Membrane)": "localizes to the cell membrane"
-    }
-    phrases = [translation.get(tag, tag) for tag in fragments if tag]
-    if not phrases:
-        return "No interpretable symbolic elements found."
-    return "This protein contains " + ", ".join(phrases[:-1]) + (
-        f", and {phrases[-1]}." if len(phrases) > 1 else f"{phrases[0]}.")
 with gr.Blocks() as demo:
-    gr.Markdown("# 🧬 GeneForgeLang AI Tools")
-    gr.Markdown("Design, interpret, describe, and export proteins using symbolic language and AI.")
-    with gr.Tab("🧠 Phrase → Protein"):
-        inp = gr.Textbox(label="GeneForgeLang Phrase", placeholder="^p:Dom(Kin)-Mot(NLS)*AcK@147")
-        out = gr.Textbox(label="Protein + Properties")
-        fasta = gr.File(label="Download FASTA")
-        btn = gr.Button("Generate")
-        btn.click(fn=generate_protein_and_props, inputs=inp, outputs=[out, fasta])
-    with gr.Tab("🧪 Protein → Phrase"):
-        inp2 = gr.Textbox(label="Protein Sequence", placeholder="MKKKPRRRDEEGEK...")
-        out2 = gr.Textbox(label="Interpreted GeneForgeLang")
-        btn2 = gr.Button("Translate")
-        btn2.click(fn=sequence_to_phrase, inputs=inp2, outputs=out2)
-    with gr.Tab("🧬 Mutate Protein"):
-        inp4 = gr.Textbox(label="GeneForgeLang Phrase", placeholder="^p:Dom(Kin)-Mot(NLS)*AcK@147")
-        out4 = gr.Textbox(label="Mutated Protein")
-        btn4 = gr.Button("Mutate")
-        btn4.click(fn=generate_protein_and_props, inputs=inp4, outputs=[out4, gr.File(visible=False)])
-    with gr.Tab("📊 Analyze Protein"):
-        inp5 = gr.Textbox(label="Protein Sequence", placeholder="Paste sequence to analyze")
-        out5 = gr.Image(label="Amino Acid Composition")
-        btn5 = gr.Button("Analyze")
-        def analyze_graph(seq): return generar_composicion_grafico(seq)
-        btn5.click(fn=analyze_graph, inputs=inp5, outputs=out5)
-    with gr.Tab("📖 Phrase → Natural Language"):
-        inp3 = gr.Textbox(label="GeneForgeLang Phrase", placeholder="^p:Dom(Kin)-Mot(NLS)*AcK@147")
-        out3 = gr.Textbox(label="Scientific Description")
-        btn3 = gr.Button("Describe")
-        btn3.click(fn=phrase_to_description, inputs=inp3, outputs=out3)
-if __name__ == "__main__":
-    demo.launch()

 import gradio as gr
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
+# Cargar el modelo solo una vez
 model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
+tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
+tokenizer.pad_token = tokenizer.eos_token
+# Traducción entre moléculas
+def transcode_phrase(phrase, src, dst):
+    if src == dst:
+        return "⚠️ Source and target are the same."
+    if src == "DNA" and dst == "RNA":
+        return phrase.replace("~d:", ":r:").replace("Exon", "Ex").replace("Intr", "removed")
+    elif src == "RNA" and dst == "Protein":
+        return phrase.replace(":r:", "^p:").replace("Ex1", "Dom(Kin)").replace("Ex2", "Mot(NLS)")
+    elif src == "Protein" and dst == "DNA":
+        return phrase.replace("^p:", "~d:").replace("Dom(Kin)", "Exon1").replace("Mot(NLS)", "Exon2")
+    else:
+        return "❌ Translation not implemented."
+# Generar proteína a partir de frase
+semillas = {
+    "^p:Dom(Kin)-Mot(NLS)*AcK@147=Localize(Nucleus)": "MKKK",
+    "^p:Mot(NLS)-Mot(PEST)*P@120": "MKSP",
+    "^p:Dom(ZnF)-Mot(NLS)*UbK@42": "MKHG",
+}
+def generar_desde_frase(frase):
+    semilla = semillas.get(frase, "MKKK")
+    inputs = tokenizer(semilla, return_tensors="pt", padding=True)
+    outputs = model.generate(**inputs, max_length=200, do_sample=True, top_k=950, temperature=1.5, num_return_sequences=1)
+    secuencia = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    return f"🧪 Seed: {semilla}
+🧬 Generated Protein:
+{secuencia}"
+# Interfaz Gradio
 with gr.Blocks() as demo:
+    with gr.Tab("Phrase → Protein"):
+        gr.Markdown("### Generate Protein Sequence from GeneForgeLang Phrase")
+        input_frase = gr.Textbox(label="Input Phrase")
+        output_prot = gr.Textbox(label="Generated Protein")
+        boton_gen = gr.Button("Generate")
+        boton_gen.click(fn=generar_desde_frase, inputs=input_frase, outputs=output_prot)
+    with gr.Tab("Transcode Across Molecules"):
+        gr.Markdown("### Convert between DNA, RNA, and Protein symbolic phrases")
+        input_phrase = gr.Textbox(label="Input GeneForgeLang Phrase")
+        src_select = gr.Radio(choices=["DNA", "RNA", "Protein"], label="Translate From", value="DNA")
+        dst_select = gr.Radio(choices=["DNA", "RNA", "Protein"], label="Translate To", value="RNA")
+        output = gr.Textbox(label="Translated Phrase")
+        trans_btn = gr.Button("Translate")
+        trans_btn.click(fn=transcode_phrase, inputs=[input_phrase, src_select, dst_select], outputs=output)
+demo.launch()

generar_desde_frase_input_v2.py ADDED Viewed

	@@ -0,0 +1,52 @@

+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+import sys
+def frase_a_semilla(frase):
+    frase = frase.lower()
+    if "dom(kin)" in frase:
+        return "MKKK"
+    elif "mot(nls)" in frase:
+        return "MPRRR"
+    elif "mot(pest)" in frase:
+        return "MDGQL"
+    elif "tf(gata1)" in frase:
+        return "MKTFG"
+    elif "*ack" in frase or "*ac" in frase:
+        return "MKQAK"
+    elif "*p" in frase or "*phos" in frase:
+        return "MKRP"
+    elif "localize(nucleus)" in frase:
+        return "MPKRK"
+    elif "localize(membrane)" in frase:
+        return "MAIFL"
+    else:
+        return "M"
+if __name__ == "__main__":
+    frase = sys.argv[1] if len(sys.argv) > 1 else "^p:Dom(Kin)'-Mot(NLS)*AcK@147=Localize(Nucleus)"
+    semilla = frase_a_semilla(frase)
+    print("🧪 Semilla generada desde la frase:", semilla)
+    tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2", do_lower_case=False)
+    model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
+    inputs = tokenizer(semilla, return_tensors="pt", padding=True)
+    input_ids = inputs["input_ids"]
+    attention_mask = inputs.get("attention_mask", torch.ones_like(input_ids))
+    with torch.no_grad():
+        salida = model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_length=100,
+            min_length=20,
+            do_sample=True,
+            top_k=50,
+            temperature=0.9,
+            pad_token_id=tokenizer.eos_token_id,
+            num_return_sequences=1
+        )
+    print("🧬 Proteína generada:")
+    print(tokenizer.decode(salida[0], skip_special_tokens=True))

generar_interactivo.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+def frase_a_semilla(frase):
+    semilla = "M"
+    if "Dom(Kin)" in frase:
+        semilla += "KKK"
+    if "Mot(NLS)" in frase:
+        semilla += "RRRR"
+    if "TF(GATA)" in frase:
+        semilla += "TFG"
+    if "*AcK@" in frase:
+        semilla += "AK"
+    return semilla
+frase = input("🔤 Escribe tu frase GeneForgeLang: ")
+semilla = frase_a_semilla(frase)
+print("🧪 Semilla generada:", semilla)
+tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2", do_lower_case=False)
+model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
+inputs = tokenizer(semilla, return_tensors="pt")
+with torch.no_grad():
+    salida = model.generate(
+        inputs["input_ids"],
+        max_length=100,
+        do_sample=True,
+        top_k=50,
+        temperature=0.8,
+        num_return_sequences=1
+    )
+print("🧬 Proteína generada:")
+print(tokenizer.decode(salida[0], skip_special_tokens=True))

requirements.txt CHANGED Viewed

@@ -1,3 +1,3 @@
-gradio==3.50.2
-transformers
-torch

+gradio==3.50.2
+transformers
+torch