ManMenGon commited on
Commit
1291f55
·
verified ·
1 Parent(s): 6ef5528

Upload 7 files

Browse files
Files changed (5) hide show
  1. README.md +106 -81
  2. app.py +51 -134
  3. generar_desde_frase_input_v2.py +52 -0
  4. generar_interactivo.py +35 -0
  5. requirements.txt +3 -3
README.md CHANGED
@@ -1,81 +1,106 @@
1
- ---
2
- title: GeneForgeLang
3
- emoji: 🧬
4
- colorFrom: indigo
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: "3.50.2"
8
- app_file: app.py
9
- pinned: true
10
- ---
11
-
12
- # 🧬 GeneForgeLang: Symbolic-to-Sequence & Cross-Modality Biomolecular Design Toolkit
13
-
14
- **GeneForgeLang** is a symbolic, generative language that allows scientists to design and interpret DNA, RNA, and protein sequences with unified syntax and AI support.
15
-
16
- This toolkit enables:
17
- - Generation of realistic proteins from symbolic design
18
- - Translation of symbolic phrases across DNA ↔ RNA ↔ Protein
19
- - Structured, human-readable and AI-trainable syntax
20
- - Semantic equivalence across molecular layers
21
-
22
- ---
23
-
24
- ## 🚀 Features
25
-
26
- | Module | Description |
27
- |----------------------------|-------------|
28
- | 🧠 Phrase Protein | Generate realistic protein sequences from symbolic phrases |
29
- | 🔁 Transcode Across Molecules | Translate GeneForgeLang phrases between DNA, RNA, and Protein |
30
- | 📚 Universal Grammar | One structure to rule them all: motifs, domains, PTMs, splicing |
31
- | 🧬 Compact Notation | Prefixes, accents, and structural markers for efficiency |
32
- | 🧠 AI-Ready Output | Compatible with transformer-based models like ProtGPT2 |
33
-
34
- ---
35
-
36
- ## 🧪 Example Input Phrases
37
-
38
- ### DNA RNA
39
-
40
- ```
41
- ~d:Prom[TATA]-Exon1-Intr1-Exon2
42
-
43
- :r:Cap5'-Ex1-Ex2-UTR3'
44
- ```
45
-
46
- ### RNA → Protein
47
-
48
- ```
49
- :r:Ex1-Ex2
50
-
51
- ^p:Dom(Kin)-Mot(NLS)
52
- ```
53
-
54
- ---
55
-
56
- ## ▶️ How to Use
57
-
58
- 1. Launch this app on Hugging Face or locally
59
- 2. Choose your input phrase and source/target modality
60
- 3. Get your translated output or generated protein
61
-
62
- ---
63
-
64
- ## 📁 Files
65
-
66
- | File | Description |
67
- |-----------------------|-------------|
68
- | `app.py` | Full Gradio app (all tabs) |
69
- | `semillas.json` | Seed dictionary |
70
- | `transcoder.py` | Script for DNA/RNA/protein conversion |
71
- | `requirements.txt` | Python dependencies |
72
- | `README.md` | This file |
73
-
74
- ---
75
-
76
- ## 🧠 Developed by
77
-
78
- **Fundación de Neurociencias**
79
- MIT License
80
-
81
- Join us in shaping symbolic bio-AI.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Gracias por compartir el `README.md` actual. Está bastante bien estructurado, pero podemos mejorarlo para que refleje **todos los módulos reales incluidos** y su utilidad práctica, y además reforzar el interés del proyecto de cara a visitantes y posibles contribuidores.
2
+
3
+ A continuación te presento una versión revisada y ampliada, lista para reemplazar el contenido actual:
4
+
5
+ ---
6
+
7
+ ```markdown
8
+ ---
9
+ title: GeneForgeLang
10
+ emoji: 🧬
11
+ colorFrom: indigo
12
+ colorTo: blue
13
+ sdk: gradio
14
+ sdk_version: "3.50.2"
15
+ app_file: app.py
16
+ pinned: true
17
+ ---
18
+
19
+ # 🧬 GeneForgeLang: Symbolic-to-Sequence & Cross-Modality Biomolecular Design Toolkit
20
+
21
+ **GeneForgeLang** is a symbolic and generative language for cross-modality biomolecular design.
22
+ It enables unified AI-powered workflows to **design, interpret and translate DNA, RNA, and protein sequences** using a compact, human-readable grammar.
23
+
24
+ This project provides:
25
+ - **A symbolic language** spanning all biological layers (genomic, transcriptomic, proteomic)
26
+ - **Realistic sequence generation** via AI models like ProtGPT2
27
+ - **Scientific interpretation** of symbolic phrases in natural language
28
+ - **Cross-modality transcoders** (e.g., DNA RNA Protein and vice versa)
29
+ - **An interactive Gradio-based UI** for easy use and integration
30
+
31
+ ---
32
+
33
+ ## 🚀 Key Features
34
+
35
+ | Module | Description |
36
+ |----------------------------|-------------|
37
+ | 🧠 Phrase → Sequence | Generate DNA, RNA, or protein from symbolic design |
38
+ | 🔁 Transcode Phrases | Translate GeneForgeLang phrases across modalities |
39
+ | 📖 Phrase → Description | Generate scientific English descriptions of symbolic inputs |
40
+ | 🔄 Sequence → Phrase | Infer functional phrases from real sequences |
41
+ | 🧬 Mutate Sequence (WIP) | Generate variants for symbolic seeds (under development) |
42
+ | 📦 Export to FASTA (WIP) | Save generated sequences to .fasta (to be implemented) |
43
+ | 📊 Analyze Sequence (WIP) | Visualize amino acid composition or base content |
44
+
45
+ ---
46
+
47
+ ## 🧪 Example Input Phrases
48
+
49
+ ```text
50
+ ~d:Prom[TATA]-Exon1-Intr1-Exon2
51
+
52
+ :r:Cap5'-Ex1-Ex2-UTR3'
53
+
54
+ ^p:Dom(Kin)-Mot(NLS)*AcK@147
55
+ ```
56
+
57
+ ---
58
+
59
+ ## ▶️ How to Use Locally
60
+
61
+ 1. Clone this repo:
62
+ ```bash
63
+ git clone https://github.com/Fundacion-de-Neurociencias/GeneForgeLang.git
64
+ cd GeneForgeLang
65
+ ```
66
+
67
+ 2. Install dependencies:
68
+ ```bash
69
+ pip install -r requirements.txt
70
+ ```
71
+
72
+ 3. Launch the interface:
73
+ ```bash
74
+ python app.py
75
+ ```
76
+
77
+ 4. Navigate to:
78
+ [http://127.0.0.1:7860](http://127.0.0.1:7860)
79
+
80
+ ---
81
+
82
+ ## 📁 File Structure
83
+
84
+ | File | Description |
85
+ |------------------------------|-------------|
86
+ | `app.py` | Full Gradio app (4 tabs) |
87
+ | `semillas.json` | Phrase-to-seed dictionary |
88
+ | `generate_from_phrase.py` | Symbolic-to-sequence generator |
89
+ | `describe_phrase.py` | Phrase interpreter to scientific English |
90
+ | `translate_to_geneforgelang.py` | Sequence-to-symbolic phrase translation |
91
+ | `transcoder.py` | Modality switcher (DNA ↔ RNA ↔ Protein) |
92
+ | `requirements.txt` | Python dependencies |
93
+ | `README.md` | This file |
94
+
95
+ ---
96
+
97
+ ## 🧠 Developed by
98
+
99
+ **Fundación de Neurociencias**
100
+ Licensed under the MIT License
101
+
102
+ > Join us in shaping the future of symbolic bio-AI. Contributions welcome!
103
+
104
+ ```
105
+
106
+ ---
app.py CHANGED
@@ -1,141 +1,58 @@
 
1
  import gradio as gr
2
  from transformers import AutoTokenizer, AutoModelForCausalLM
3
  import torch
4
- import json
5
- import re
6
- import tempfile
7
-
8
- # Load symbolic phrase dictionary
9
- with open("semillas.json", "r", encoding="utf-8") as f:
10
- diccionario_semillas = json.load(f)
11
-
12
- def phrase_to_seed(phrase):
13
- phrase = phrase.lower()
14
- for key, seed in diccionario_semillas.items():
15
- if key.lower() in phrase:
16
- return seed
17
- return "M"
18
 
19
- tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2", do_lower_case=False)
20
- tokenizer.pad_token = tokenizer.eos_token
21
  model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
 
 
22
 
23
- def generate_protein_and_props(phrase):
24
- seed = phrase_to_seed(phrase)
25
- inputs = tokenizer(seed, return_tensors="pt", padding=True)
26
- input_ids = inputs["input_ids"]
27
- attention_mask = inputs.get("attention_mask", torch.ones_like(input_ids))
28
-
29
- with torch.no_grad():
30
- output = model.generate(
31
- input_ids=input_ids,
32
- attention_mask=attention_mask,
33
- max_length=100,
34
- min_length=20,
35
- do_sample=True,
36
- top_k=50,
37
- temperature=0.9,
38
- pad_token_id=tokenizer.eos_token_id,
39
- num_return_sequences=1
40
- )
41
-
42
- seq = tokenizer.decode(output[0], skip_special_tokens=True)
43
-
44
- # Calculate properties
45
- length = len(seq)
46
- aa_count = {aa: seq.count(aa) for aa in "ACDEFGHIKLMNPQRSTVWY"}
47
- charge = sum([aa_count.get(a, 0) for a in "KR"]) - sum([aa_count.get(a, 0) for a in "DE"])
48
- mw = sum([aa_count[a]*w for a, w in {
49
- "A": 89.1, "C": 121.2, "D": 133.1, "E": 147.1, "F": 165.2,
50
- "G": 75.1, "H": 155.2, "I": 131.2, "K": 146.2, "L": 131.2,
51
- "M": 149.2, "N": 132.1, "P": 115.1, "Q": 146.2, "R": 174.2,
52
- "S": 105.1, "T": 119.1, "V": 117.1, "W": 204.2, "Y": 181.2
53
- }.items()])
54
-
55
- props = f"🧪 Seed: {seed}\n🧬 Protein: {seq}\n\n🔬 Properties:\n- Length: {length} aa\n- Charge: {charge}\n- MW: {mw:.1f} Da"
56
-
57
- # Save to FASTA
58
- with tempfile.NamedTemporaryFile(delete=False, suffix=".fasta", mode="w", encoding="utf-8") as f:
59
- f.write(f">Generated_Protein\n{seq}\n")
60
- fasta_path = f.name
61
-
62
- return props, fasta_path
63
-
64
- def sequence_to_phrase(seq):
65
- seq = seq.upper()
66
- tags = []
67
- if re.search(r"^M*K{3,}", seq):
68
- tags.append("Dom(Kin)")
69
- if re.search(r"[RK]{3,}", seq):
70
- tags.append("Mot(NLS)")
71
- if len(re.findall(r"E", seq)) >= 5 or "DEG" in seq:
72
- tags.append("Mot(PEST)")
73
- if re.search(r"KQAK|QAK", seq):
74
- tags.append("*AcK@X")
75
- if re.search(r"[RST]P", seq):
76
- tags.append("*P@X")
77
- if "PRKRK" in seq or "PKKKRKV" in seq:
78
- tags.append("Localize(Nucleus)")
79
- if re.search(r"(AILFL|LAGGAV|LVLL|AAVL)", seq):
80
- tags.append("Localize(Membrane)")
81
- return "^p:" + "-".join(sorted(set(tags))) if tags else "// No symbolic motifs found"
82
-
83
- def phrase_to_description(phrase):
84
- phrase = phrase.replace("^p:", "")
85
- fragments = phrase.split("-")
86
- translation = {
87
- "Dom(Kin)": "a kinase domain",
88
- "Mot(NLS)": "a nuclear localization signal",
89
- "Mot(PEST)": "a PEST motif indicating protein degradation",
90
- "*AcK@X": "lysine acetylation at a specific position",
91
- "*P@X": "a phosphorylation site",
92
- "Localize(Nucleus)": "localizes to the cell nucleus",
93
- "Localize(Membrane)": "localizes to the cell membrane"
94
- }
95
- phrases = [translation.get(tag, tag) for tag in fragments if tag]
96
- if not phrases:
97
- return "No interpretable symbolic elements found."
98
- return "This protein contains " + ", ".join(phrases[:-1]) + (
99
- f", and {phrases[-1]}." if len(phrases) > 1 else f"{phrases[0]}.")
100
-
101
  with gr.Blocks() as demo:
102
- gr.Markdown("# 🧬 GeneForgeLang AI Tools")
103
- gr.Markdown("Design, interpret, describe, and export proteins using symbolic language and AI.")
104
-
105
- with gr.Tab("🧠 Phrase → Protein"):
106
- inp = gr.Textbox(label="GeneForgeLang Phrase", placeholder="^p:Dom(Kin)-Mot(NLS)*AcK@147")
107
- out = gr.Textbox(label="Protein + Properties")
108
- fasta = gr.File(label="Download FASTA")
109
- btn = gr.Button("Generate")
110
- btn.click(fn=generate_protein_and_props, inputs=inp, outputs=[out, fasta])
111
-
112
- with gr.Tab("🧪 Protein Phrase"):
113
- inp2 = gr.Textbox(label="Protein Sequence", placeholder="MKKKPRRRDEEGEK...")
114
- out2 = gr.Textbox(label="Interpreted GeneForgeLang")
115
- btn2 = gr.Button("Translate")
116
- btn2.click(fn=sequence_to_phrase, inputs=inp2, outputs=out2)
117
-
118
-
119
- with gr.Tab("🧬 Mutate Protein"):
120
- inp4 = gr.Textbox(label="GeneForgeLang Phrase", placeholder="^p:Dom(Kin)-Mot(NLS)*AcK@147")
121
- out4 = gr.Textbox(label="Mutated Protein")
122
- btn4 = gr.Button("Mutate")
123
- btn4.click(fn=generate_protein_and_props, inputs=inp4, outputs=[out4, gr.File(visible=False)])
124
-
125
-
126
- with gr.Tab("📊 Analyze Protein"):
127
- inp5 = gr.Textbox(label="Protein Sequence", placeholder="Paste sequence to analyze")
128
- out5 = gr.Image(label="Amino Acid Composition")
129
- btn5 = gr.Button("Analyze")
130
- def analyze_graph(seq): return generar_composicion_grafico(seq)
131
- btn5.click(fn=analyze_graph, inputs=inp5, outputs=out5)
132
-
133
- with gr.Tab("📖 Phrase → Natural Language"):
134
-
135
- inp3 = gr.Textbox(label="GeneForgeLang Phrase", placeholder="^p:Dom(Kin)-Mot(NLS)*AcK@147")
136
- out3 = gr.Textbox(label="Scientific Description")
137
- btn3 = gr.Button("Describe")
138
- btn3.click(fn=phrase_to_description, inputs=inp3, outputs=out3)
139
-
140
- if __name__ == "__main__":
141
- demo.launch()
 
1
+
2
  import gradio as gr
3
  from transformers import AutoTokenizer, AutoModelForCausalLM
4
  import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
+ # Cargar el modelo solo una vez
 
7
  model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
8
+ tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
9
+ tokenizer.pad_token = tokenizer.eos_token
10
 
11
+ # Traducción entre moléculas
12
+ def transcode_phrase(phrase, src, dst):
13
+ if src == dst:
14
+ return "⚠️ Source and target are the same."
15
+ if src == "DNA" and dst == "RNA":
16
+ return phrase.replace("~d:", ":r:").replace("Exon", "Ex").replace("Intr", "removed")
17
+ elif src == "RNA" and dst == "Protein":
18
+ return phrase.replace(":r:", "^p:").replace("Ex1", "Dom(Kin)").replace("Ex2", "Mot(NLS)")
19
+ elif src == "Protein" and dst == "DNA":
20
+ return phrase.replace("^p:", "~d:").replace("Dom(Kin)", "Exon1").replace("Mot(NLS)", "Exon2")
21
+ else:
22
+ return "❌ Translation not implemented."
23
+
24
+ # Generar proteína a partir de frase
25
+ semillas = {
26
+ "^p:Dom(Kin)-Mot(NLS)*AcK@147=Localize(Nucleus)": "MKKK",
27
+ "^p:Mot(NLS)-Mot(PEST)*P@120": "MKSP",
28
+ "^p:Dom(ZnF)-Mot(NLS)*UbK@42": "MKHG",
29
+ }
30
+
31
+ def generar_desde_frase(frase):
32
+ semilla = semillas.get(frase, "MKKK")
33
+ inputs = tokenizer(semilla, return_tensors="pt", padding=True)
34
+ outputs = model.generate(**inputs, max_length=200, do_sample=True, top_k=950, temperature=1.5, num_return_sequences=1)
35
+ secuencia = tokenizer.decode(outputs[0], skip_special_tokens=True)
36
+ return f"🧪 Seed: {semilla}
37
+ 🧬 Generated Protein:
38
+ {secuencia}"
39
+
40
+ # Interfaz Gradio
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  with gr.Blocks() as demo:
42
+ with gr.Tab("Phrase Protein"):
43
+ gr.Markdown("### Generate Protein Sequence from GeneForgeLang Phrase")
44
+ input_frase = gr.Textbox(label="Input Phrase")
45
+ output_prot = gr.Textbox(label="Generated Protein")
46
+ boton_gen = gr.Button("Generate")
47
+ boton_gen.click(fn=generar_desde_frase, inputs=input_frase, outputs=output_prot)
48
+
49
+ with gr.Tab("Transcode Across Molecules"):
50
+ gr.Markdown("### Convert between DNA, RNA, and Protein symbolic phrases")
51
+ input_phrase = gr.Textbox(label="Input GeneForgeLang Phrase")
52
+ src_select = gr.Radio(choices=["DNA", "RNA", "Protein"], label="Translate From", value="DNA")
53
+ dst_select = gr.Radio(choices=["DNA", "RNA", "Protein"], label="Translate To", value="RNA")
54
+ output = gr.Textbox(label="Translated Phrase")
55
+ trans_btn = gr.Button("Translate")
56
+ trans_btn.click(fn=transcode_phrase, inputs=[input_phrase, src_select, dst_select], outputs=output)
57
+
58
+ demo.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
generar_desde_frase_input_v2.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoTokenizer, AutoModelForCausalLM
2
+ import torch
3
+ import sys
4
+
5
+ def frase_a_semilla(frase):
6
+ frase = frase.lower()
7
+ if "dom(kin)" in frase:
8
+ return "MKKK"
9
+ elif "mot(nls)" in frase:
10
+ return "MPRRR"
11
+ elif "mot(pest)" in frase:
12
+ return "MDGQL"
13
+ elif "tf(gata1)" in frase:
14
+ return "MKTFG"
15
+ elif "*ack" in frase or "*ac" in frase:
16
+ return "MKQAK"
17
+ elif "*p" in frase or "*phos" in frase:
18
+ return "MKRP"
19
+ elif "localize(nucleus)" in frase:
20
+ return "MPKRK"
21
+ elif "localize(membrane)" in frase:
22
+ return "MAIFL"
23
+ else:
24
+ return "M"
25
+
26
+ if __name__ == "__main__":
27
+ frase = sys.argv[1] if len(sys.argv) > 1 else "^p:Dom(Kin)'-Mot(NLS)*AcK@147=Localize(Nucleus)"
28
+ semilla = frase_a_semilla(frase)
29
+ print("🧪 Semilla generada desde la frase:", semilla)
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2", do_lower_case=False)
32
+ model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
33
+
34
+ inputs = tokenizer(semilla, return_tensors="pt", padding=True)
35
+ input_ids = inputs["input_ids"]
36
+ attention_mask = inputs.get("attention_mask", torch.ones_like(input_ids))
37
+
38
+ with torch.no_grad():
39
+ salida = model.generate(
40
+ input_ids=input_ids,
41
+ attention_mask=attention_mask,
42
+ max_length=100,
43
+ min_length=20,
44
+ do_sample=True,
45
+ top_k=50,
46
+ temperature=0.9,
47
+ pad_token_id=tokenizer.eos_token_id,
48
+ num_return_sequences=1
49
+ )
50
+
51
+ print("🧬 Proteína generada:")
52
+ print(tokenizer.decode(salida[0], skip_special_tokens=True))
generar_interactivo.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoTokenizer, AutoModelForCausalLM
2
+ import torch
3
+
4
+ def frase_a_semilla(frase):
5
+ semilla = "M"
6
+ if "Dom(Kin)" in frase:
7
+ semilla += "KKK"
8
+ if "Mot(NLS)" in frase:
9
+ semilla += "RRRR"
10
+ if "TF(GATA)" in frase:
11
+ semilla += "TFG"
12
+ if "*AcK@" in frase:
13
+ semilla += "AK"
14
+ return semilla
15
+
16
+ frase = input("🔤 Escribe tu frase GeneForgeLang: ")
17
+ semilla = frase_a_semilla(frase)
18
+ print("🧪 Semilla generada:", semilla)
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2", do_lower_case=False)
21
+ model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
22
+
23
+ inputs = tokenizer(semilla, return_tensors="pt")
24
+ with torch.no_grad():
25
+ salida = model.generate(
26
+ inputs["input_ids"],
27
+ max_length=100,
28
+ do_sample=True,
29
+ top_k=50,
30
+ temperature=0.8,
31
+ num_return_sequences=1
32
+ )
33
+
34
+ print("🧬 Proteína generada:")
35
+ print(tokenizer.decode(salida[0], skip_special_tokens=True))
requirements.txt CHANGED
@@ -1,3 +1,3 @@
1
- gradio==3.50.2
2
- transformers
3
- torch
 
1
+ gradio==3.50.2
2
+ transformers
3
+ torch