AstraBert commited on
Commit
01554a7
·
1 Parent(s): 53c5633

first commit

Browse files
Files changed (7) hide show
  1. .gitignore +1 -0
  2. README.md +104 -4
  3. app.py +69 -0
  4. load_model.py +6 -0
  5. model.py +281 -0
  6. predict.py +40 -0
  7. requirements.txt +8 -0
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ SacCerML.joblib
README.md CHANGED
@@ -1,13 +1,113 @@
1
  ---
2
  title: Saccharomyces Pythia
3
- emoji: 📈
4
  colorFrom: purple
5
  colorTo: gray
6
  sdk: gradio
7
- sdk_version: 4.26.0
8
  app_file: app.py
9
- pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Saccharomyces Pythia
3
+ emoji: 🍄
4
  colorFrom: purple
5
  colorTo: gray
6
  sdk: gradio
7
+ sdk_version: 4.25.0
8
  app_file: app.py
9
+ pinned: true
10
  license: apache-2.0
11
  ---
12
 
13
+ # saccharomyces-pythia: an ML/AI-integrated *Saccharomyces cerevisiae* assistant
14
+
15
+ ## Table of Contents
16
+ 1. [Introduction](#introduction)
17
+ 2. [SacCerML: the base ML model](#saccerml-the-base-ml-model)
18
+ - [Training](#training)
19
+ * [Data and preprocessing](#data-and-preprocessing)
20
+ * [Validation](#validation)
21
+ - [Testing](#testing)
22
+ 3. [saccharomyces-pythia: gene calling and AI integration](#saccharomyces-pythia-gene-calling-and-ai-integration)
23
+ 4. [Try it out!](#try-it-out)
24
+ 5. [References](#references)
25
+ 6. [License](#license)
26
+
27
+ ## Introduction
28
+ **saccharomyces-pythia** is the new, rebranded v1.0.0 of SacCerML. Initially conceived as a Python script that leveraged machine learning and bioinformatics tools to predict genes in Saccharomyces cerevisiae (baker's yeast) genomic sequences, it is now a complete and AI-integrated tool that can help researchers both as a chatbot and as a ORF-predicter.
29
+
30
+ ## SacCerML: the base ML model
31
+
32
+ ### Training
33
+
34
+ #### Data and preprocessing
35
+ All the annotated coding DNA sequences for *S. cerevisiae* (strain S288C) were downloaded from Saccharomyces Genome Database.
36
+
37
+ These genetic sequences were split according to their ORF classification (verified, dubious, uncharacterized, pseudogene and transposable element) and for each of them the following parameters were calculated:
38
+
39
+ - Codon Adaptation Index
40
+ - Checksum
41
+
42
+ After that, DNA was translated into aminoacids and other descriptors were retrieved:
43
+
44
+ - Hydrophobicity
45
+ - Isoelectric point
46
+ - Aromaticity
47
+ - Instability
48
+ - Molecular weight
49
+ - Secondary structure percentage (helix, turn and sheet)
50
+ - Molar extinction (both oxidized and reduced)
51
+
52
+ All the computed data were stored in a csv file, which was used to train a supervised ML model, a Voting Classifier (implemented in scikit-learn package), made up by HistGradient Boosting Classifier, a Decision Tree Clasifier and an Extra Tree Classifier.
53
+
54
+ #### Validation
55
+ The so-obtained machine-larning model (called SacCerML) was then evaluated on the entire training set, yielding a 99.93% accuracy. A key component of the training was k-fold crossvalidation. SacCerML was trained on increasingly wider percentages of the training data and tested on the remainder: it yielded a high accuracy (>84%) in all the tests, and the same goes for recall, f1 and precision score. From the classification reports it could be already seen a slight bias towards predicting verified and dubious ORFs, with more difficulty in predicting uncharacterized ORFs.
56
+
57
+ ### Testing
58
+ Data were collected from ORFs of 10 *Saccharomyces cerevisiae* strains, different from the one used for training:
59
+
60
+ - AWRI1631
61
+ - BC187
62
+ - BY4741
63
+ - CBS7960
64
+ - FL100
65
+ - g833-1B
66
+ - Kyokai7
67
+ - LalvinQA23
68
+ - Vin13
69
+ - YS9
70
+
71
+ A total of 54452 transcripts were collected and processed into csv file by extracting the previously mentioned features. The model performed well, it had overall accuracy, f1, precision and recall score always above 86%. Nevertheless, the slight bias towards verified and dubious ORFs was confirmed, though uncharacterized ORFs too were well detected in several tests.
72
+
73
+ ## saccharomyces-pythia: gene calling and AI integration
74
+
75
+ SacCerML has now reached a new stage of its development (v1.0.0), where it has been rebranded as **saccharomyces-pythia**. You can now enjoy the following upgrades, that make it user-friendly and easy to install:
76
+
77
+ - `Gradio <https://www.gradio.app/>`_ chatbot interface running completely locally on your computer
78
+ - Gene calling with automated ORF detection thanks to `orfipy <https://pypi.org/project/orfipy/>`_: no need for preprocessing your reads, just upload one or more FASTA files with *S. cerevisiae* DNA sequences to the chatbot.
79
+ - AI assistant, built upon `EleutherAI/pythia-160-deduped-v0 <https://huggingface.co/EleutherAI/pythia-160m-deduped-v0>`_ finetuned on *Saccharomyces cerevisiae and its industrial applications* (Parapouli et al., 2020): this is a text-generation model that will reply to researcher questions (stil a beta feature, will become more stable in future releases).
80
+ - Docker image to download and run the application on your computer
81
+
82
+ ## Try it out!
83
+ Use the following commands to run **saccharomyces-pythia** on your computer:
84
+
85
+ ```bash
86
+ docker pull ghcr.io/astrabert/saccharomyces-pythia:latest
87
+ docker run -p 7860:7860 ghcr.io/astrabert/saccharomyces-pythia:latest
88
+ ```
89
+ Just wait 30s-1min, the app should then be running on port 0.0.0.0:7860 (Linux-based) or localhost:7860 (Windows-based).
90
+
91
+ ## References
92
+
93
+ * Saccharomyces Genome Database: <https://www.yeastgenome.org/>
94
+ * Biopython: <https://biopython.org/>
95
+ * Scikit-learn: <https://scikit-learn.org/stable/>
96
+ * Gradio: <https://www.gradio.app/>
97
+ * orfipy: <https://pypi.org/project/orfipy/>
98
+ * EleutherAI/pythia-160-deduped-v0: <https://huggingface.co/EleutherAI/pythia-160m-deduped-v0>
99
+ * Parapouli et al., 2020: <https://doi.org/10.3934/microbiol.2020001>
100
+
101
+ Additionally, the following libraries and packages were used in the development of the machine learning model:
102
+
103
+ * NumPy: <https://numpy.org/>
104
+ * Pandas: <https://pandas.pydata.org/>
105
+
106
+ These libraries and packages were used for data manipulation, analysis, and model training.
107
+
108
+ ## License
109
+ The project is hereby provided under MIT license.
110
+
111
+ If you are using saccharomyces-pythia for your work, please consider citing its author, [Astra Bertelli](https://astrabert.vercel.app)
112
+
113
+ *How was this README generated? Leveraging the power of AI with reAIdme, an HuggingChat assistant based on meta-llama/Llama-2-70b-chat-hf. Go and give it a try at this link: <https://hf.co/chat/assistant/660d9a4f590a7924eed02a32!> 🤖*
app.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import time
4
+ from transformers import pipeline
5
+ from predict import *
6
+ from load_model import *
7
+
8
+ def print_like_dislike(x: gr.LikeData):
9
+ print(x.index, x.value, x.liked)
10
+
11
+ def add_message(history, message):
12
+ if len(message["files"]) > 0:
13
+ history.append((message["files"], None))
14
+ if message["text"] is not None and message["text"] != "":
15
+ history.append((message["text"], None))
16
+ return history, gr.MultimodalTextbox(value=None, interactive=False)
17
+
18
+
19
+ def bot(history):
20
+ global tsk
21
+ if type(history[-1][0]) != tuple:
22
+ try:
23
+ pipe = pipeline("text-generation", tokenizer=tokenizer, model=model)
24
+ response = pipe(history[-1][0])[0]
25
+ response = response["generated_text"]
26
+ history[-1][1] = ""
27
+ for character in response:
28
+ history[-1][1] += character
29
+ time.sleep(0.05)
30
+ yield history
31
+ except Exception as e:
32
+ response = f"Sorry, the error '{e}' occured while generating the response; check [troubleshooting documentation](https://astrabert.github.io/everything-rag/#troubleshooting) for more"
33
+ if type(history[-1][0]) == tuple:
34
+ filelist = []
35
+ for i in history[-1][0]:
36
+ filelist.append(i)
37
+ if len(filelist) > 1:
38
+ finalfasta = merge_fastas(filelist)
39
+ else:
40
+ finalfasta = filelist[0]
41
+ response = predict_genes(finalfasta)
42
+ history[-1][1] = ""
43
+ for character in response:
44
+ history[-1][1] += character
45
+ time.sleep(0.05)
46
+ yield history
47
+
48
+ with gr.Blocks() as demo:
49
+ chatbot = gr.Chatbot(
50
+ [[None, " Welcome to Saccharomyces-Pythia, your helpful assistant for all things Saccharomyces cerevisiae! I am here to provide you with fascinating facts about this important model organism, as well as aid in the prediction of open reading frames (ORFs) and their corresponding types from any S. cerevisiae genetic sequence you may have. Simply upload your FASTA file, and let me work my magic. Rest assured, accuracy and efficiency are at the core of my design. Prepare to be enlightened on the wonders of yeast genomics and beyond. Let's get started!"]],
51
+ label="Saccharomyces-Pythia",
52
+ elem_id="chatbot",
53
+ bubble_full_width=False,
54
+ )
55
+
56
+ chat_input = gr.MultimodalTextbox(interactive=True, file_types=["pdf"], placeholder="Enter message or upload file...", show_label=False)
57
+
58
+ chat_msg = chat_input.submit(add_message, [chatbot, chat_input], [chatbot, chat_input])
59
+ bot_msg = chat_msg.then(bot, chatbot, chatbot, api_name="bot_response")
60
+ bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])
61
+
62
+ chatbot.like(print_like_dislike, None, None)
63
+ clear = gr.ClearButton(chatbot)
64
+
65
+ demo.queue()
66
+ if __name__ == "__main__":
67
+ demo.launch()
68
+
69
+
load_model.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ from transformers import AutoModelForCausalLM, AutoTokenizer
2
+
3
+
4
+ model_checkpoint = "as-cle-bert/saccharomyces-pythia-v1"
5
+ model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
6
+ tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model.py ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from sklearn.model_selection import train_test_split
3
+ from sklearn.ensemble import VotingClassifier
4
+ from sklearn.ensemble import VotingClassifier, HistGradientBoostingClassifier, ExtraTreesClassifier
5
+ from sklearn.tree import DecisionTreeClassifier
6
+ from Bio.SeqUtils.ProtParam import ProteinAnalysis
7
+ from Bio.SeqUtils.CheckSum import crc32
8
+ from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex
9
+ from Bio.SeqUtils.CodonUsageIndices import SharpEcoliIndex
10
+ from Bio.SeqUtils import six_frame_translations
11
+ from Bio.Seq import Seq
12
+ from Bio import SeqIO
13
+ import gzip
14
+ from math import floor
15
+ from sklearn.metrics import accuracy_score
16
+ from orfipy_core import orfs
17
+ import sys
18
+ import matplotlib.pyplot as plt
19
+
20
+
21
+ def load_data(infile):
22
+ """Load data from infile if it is in fasta format (after having unzipped it, if it is zipped)"""
23
+ if infile.endswith(".gz"): # If file is gzipped, unzip it
24
+ y = gzip.open(infile, "rt", encoding="latin-1")
25
+ # Read file as fasta if it is fasta
26
+ if (
27
+ infile.endswith(".fasta.gz")
28
+ or infile.endswith(".fna.gz")
29
+ or infile.endswith(".fas.gz")
30
+ or infile.endswith(".fa.gz")
31
+ ):
32
+ records = SeqIO.parse(y, "fasta")
33
+ sequences = {}
34
+ for record in records:
35
+ sequences.update({str(record.id): str(record.seq)})
36
+ y.close()
37
+ return sequences
38
+ else:
39
+ y.close()
40
+ raise ValueError("File is the wrong format")
41
+ # Read file directly as fasta if it is a not zipped fasta: handle also more uncommon extensions :-)
42
+ elif (
43
+ infile.endswith(".fasta")
44
+ or infile.endswith(".fna")
45
+ or infile.endswith(".fas")
46
+ or infile.endswith(".fa")
47
+ ):
48
+ with open(infile, "r") as y:
49
+ records = SeqIO.parse(y, "fasta")
50
+ sequences = {}
51
+ for record in records:
52
+ sequences.update({str(record.id): str(record.seq)})
53
+ y.close()
54
+ return sequences
55
+ else:
56
+ raise ValueError("File is the wrong format")
57
+
58
+
59
+ def calculate_cai(dna, index=SharpEcoliIndex):
60
+ cai = CodonAdaptationIndex()
61
+ cai.set_cai_index(index)
62
+ if len(dna) % 3 == 0:
63
+ a = cai.cai_for_gene(dna)
64
+ else:
65
+ six_translated = six_frame_translations(dna)
66
+ n = six_translated.split("\n")
67
+ frames = {
68
+ "0;F": n[5],
69
+ "1;F": n[6],
70
+ "2;F": n[7],
71
+ "0;R": n[12],
72
+ "1;R": n[11],
73
+ "2;R": n[10],
74
+ }
75
+ ind = 0
76
+ for i in list(frames.keys()):
77
+ k = frames[i].replace(" ", "")
78
+ if "M" in k and "*" in k:
79
+ if i.split(";")[0] == "F" and k.index("M") < k.index("*"):
80
+ if len(k) <= len(dna) / 3:
81
+ ind = int(i.split("")[0])
82
+ break
83
+ elif i.split(";")[0] == "R" and k.index("M") > k.index("*"):
84
+ if len(k) <= len(dna) / 3:
85
+ ind = len(dna) - int(i.split("")[0])
86
+ break
87
+ if ind == 0:
88
+ cods = 3 * floor(len(dna) / 3)
89
+ dna = dna[:cods]
90
+ a = cai.cai_for_gene(dna)
91
+ elif 1 <= ind <= 2:
92
+ if len(dna[ind:]) % 3 == 0:
93
+ dna = dna[ind:]
94
+ else:
95
+ cods = 3 * floor((len(dna) - ind) / 3)
96
+ dna = dna[ind : cods + ind]
97
+ a = cai.cai_for_gene(dna)
98
+ else:
99
+ if len(dna[:ind]) % 3 == 0:
100
+ dna = dna[ind:]
101
+ else:
102
+ cods = 3 * floor((len(dna) - ind) / 3)
103
+ dna = dna[:cods]
104
+ a = cai.cai_for_gene(dna)
105
+ return a
106
+
107
+
108
+ def checksum(dna):
109
+ return crc32(dna)
110
+
111
+
112
+ def hidrophobicity(dna):
113
+ protein_sequence = str(Seq(dna).translate())
114
+ protein_sequence = protein_sequence.replace("*", "")
115
+ hydrophobicity_score = ProteinAnalysis(protein_sequence).gravy()
116
+ return hydrophobicity_score
117
+
118
+
119
+ def isoelectric_pt(dna):
120
+ protein_sequence = str(Seq(dna).translate())
121
+ protein_sequence = protein_sequence.replace("*", "")
122
+ isoelectric = ProteinAnalysis(protein_sequence).isoelectric_point()
123
+ return isoelectric
124
+
125
+
126
+ def aromatic(dna):
127
+ protein_sequence = str(Seq(dna).translate())
128
+ protein_sequence = protein_sequence.replace("*", "")
129
+ arom = ProteinAnalysis(protein_sequence).aromaticity()
130
+ return arom
131
+
132
+
133
+ def instable(dna):
134
+ protein_sequence = str(Seq(dna).translate())
135
+ protein_sequence = protein_sequence.replace("*", "")
136
+ inst = ProteinAnalysis(protein_sequence).instability_index()
137
+ return inst
138
+
139
+
140
+ def weight(dna):
141
+ protein_sequence = str(Seq(dna).translate())
142
+ protein_sequence = protein_sequence.replace("*", "")
143
+ wgt = ProteinAnalysis(protein_sequence).molecular_weight()
144
+ return wgt
145
+
146
+
147
+ def sec_struct(dna):
148
+ protein_sequence = str(Seq(dna).translate())
149
+ protein_sequence = protein_sequence.replace("*", "")
150
+ second_struct = ProteinAnalysis(protein_sequence).secondary_structure_fraction()
151
+ return ",".join([str(s) for s in second_struct])
152
+
153
+
154
+ def mol_ext(dna):
155
+ protein_sequence = str(Seq(dna).translate())
156
+ protein_sequence = protein_sequence.replace("*", "")
157
+ molar_ext = ProteinAnalysis(protein_sequence).molar_extinction_coefficient()
158
+ return ",".join([str(s) for s in molar_ext])
159
+
160
+
161
+ def longest_orf(coding):
162
+ keys_M_starting = [
163
+ key
164
+ for key in list(coding.keys())
165
+ if str(Seq(coding[key]).translate()).startswith("M")
166
+ ]
167
+ M_starting = [
168
+ seq
169
+ for seq in list(coding.values())
170
+ if str(Seq(seq).translate()).startswith("M")
171
+ ]
172
+ lengths = [len(seq) for seq in M_starting]
173
+ max_ind = lengths.index(max(lengths))
174
+ return {keys_M_starting[max_ind]: M_starting[max_ind]}
175
+
176
+
177
+ def predict_orf(seq, minlen=45, maxlen=18000, longest_M_starting_orf_only=True):
178
+ ls = orfs(seq, minlen=minlen, maxlen=maxlen)
179
+ coding = {}
180
+ count = 0
181
+ for start, stop, strand, description in ls:
182
+ count += 1
183
+ coding.update({f"ORF.{count}": seq[int(start) : int(stop)]})
184
+ if longest_M_starting_orf_only:
185
+ print(
186
+ "\n---------------------------\nWarning: option longest_M_starting_orf_only is set to True and thus you will get only the longest M-starting ORF; to get all the ORFs, set it to False\n---------------------------\n",
187
+ file=sys.stderr,
188
+ )
189
+ return longest_orf(coding)
190
+ return coding
191
+
192
+
193
+ def process_dna(fasta_file):
194
+ fas = load_data(fasta_file)
195
+ seqs = [seq for seq in list(fas.values())]
196
+ heads = [seq for seq in list(fas.keys())]
197
+ data = {}
198
+ proteins = {}
199
+ for i in range(len(seqs)):
200
+ coding = predict_orf(seqs[i])
201
+ open_reading_frames = list(coding.keys())
202
+ for key in open_reading_frames:
203
+ head = f"{heads[i]}.{key}"
204
+ proteins.update({head: str(Seq(coding[key]).translate())})
205
+ cai = calculate_cai(coding[key])
206
+ cksm = checksum(coding[key])
207
+ hydr = hidrophobicity(coding[key])
208
+ isl = isoelectric_pt(coding[key])
209
+ arm = aromatic(coding[key])
210
+ inst = instable(coding[key])
211
+ mw = weight(coding[key])
212
+ se_st = sec_struct(coding[key]).split(",")
213
+ se_st1 = se_st[0]
214
+ se_st2 = se_st[1]
215
+ se_st3 = se_st[2]
216
+ me = mol_ext(coding[key]).split(",")
217
+ me1 = me[0]
218
+ me2 = me[1]
219
+ n = pd.DataFrame(
220
+ {
221
+ "CAI": [cai],
222
+ "CHECKSUM": [cksm],
223
+ "HIDROPHOBICITY": [hydr],
224
+ "ISOELECTRIC": [isl],
225
+ "AROMATIC": [arm],
226
+ "INSTABLE": [inst],
227
+ "MW": [mw],
228
+ "HELIX": [se_st1],
229
+ "TURN": [se_st2],
230
+ "SHEET": [se_st3],
231
+ "MOL_EXT_RED": [me1],
232
+ "MOL_EXT_OX": [me2],
233
+ }
234
+ )
235
+ data.update({head: n})
236
+ return data, proteins
237
+
238
+ if __name__ == "__main__":
239
+ print("Loading data...")
240
+ # Load the data from the CSV file
241
+ data = pd.read_csv("../../data/scerevisiae.csv")
242
+ print("Loaded data")
243
+
244
+ print("Generating training and test data...")
245
+ # Features
246
+ X = data.iloc[:, 1:]
247
+
248
+ # Labels
249
+ y = data["ORF_TYPE"]
250
+
251
+
252
+ # Split the data into training and testing sets
253
+ X_train, X_test, y_train, y_test = train_test_split(
254
+ X, y, test_size=0.2, random_state=42
255
+ )
256
+ print("Generated training and test data")
257
+
258
+ print("Building and training the model...")
259
+ # Create and train the Random Forest classifier
260
+ clf4 = DecisionTreeClassifier()
261
+ clf7 = HistGradientBoostingClassifier()
262
+ clf8 = ExtraTreesClassifier()
263
+ classifier = VotingClassifier([('dt', clf4), ('hgb', clf7), ('etc', clf8)], voting='hard')
264
+
265
+ model = classifier.fit(X, y) # Uncomment this line if clf needs training
266
+
267
+
268
+ # Make predictions on the test set
269
+ y_pred = model.predict(X)
270
+
271
+ # Evaluate the accuracy of the model
272
+ accuracy = accuracy_score(y, y_pred)
273
+ print(f"Accuracy: {accuracy}")
274
+
275
+ from joblib import dump
276
+
277
+ print("Saving model...")
278
+ dump(model, "SacCerML.joblib")
279
+ print("Saved")
280
+
281
+ print("All done")
predict.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from joblib import load
2
+ from model import process_dna
3
+
4
+
5
+
6
+ loaded_model = load("SacCerML.joblib")
7
+
8
+ def merge_fastas(fileslist):
9
+ finale = []
10
+ finalfile = fileslist[-1].split(".")[0]+"_mergedfastas.fasta"
11
+ for fl in fileslist:
12
+ f = open(fl, "r")
13
+ lines = f.readlines()
14
+ f.close()
15
+ for line in lines:
16
+ finale.append(line)
17
+ fnlfl = open(finalfile, "w")
18
+ for l in finale:
19
+ if l.endswith("\n"):
20
+ fnlfl.write(l)
21
+ else:
22
+ fnlfl.write(l+"\n")
23
+ fnlfl.close()
24
+ return finalfile
25
+
26
+ def predict_genes(infile, model=loaded_model):
27
+ X, proteins = process_dna(infile)
28
+ headers = list(X.keys())
29
+ predictions = []
30
+ for x in list(X.values()):
31
+ p = model.predict(x)
32
+ predictions.append(p)
33
+ msg = []
34
+ for i in range(len(predictions)):
35
+ msg.append(
36
+ f"{headers[i]} protein sequence is\n{proteins[headers[i]]}\nand is predicted as {predictions[i][0]}\n"
37
+ )
38
+ message = "".join(msg)
39
+ return message
40
+
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ biopython==1.81
2
+ orfipy==0.0.4
3
+ scikit-learn==1.2.2
4
+ pandas==2.0.3
5
+ gradio==4.25.0
6
+ transformers
7
+ trl
8
+ peft