Spaces:

as-cle-bert
/

saccharomyces-pythia

Sleeping

App Files Files Community

AstraBert commited on Apr 12, 2024

Commit

01554a7

1 Parent(s): 53c5633

first commit

Browse files

Files changed (7) hide show

.gitignore +1 -0
README.md +104 -4
app.py +69 -0
load_model.py +6 -0
model.py +281 -0
predict.py +40 -0
requirements.txt +8 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ SacCerML.joblib

README.md CHANGED Viewed

@@ -1,13 +1,113 @@
 ---
 title: Saccharomyces Pythia
-emoji: 📈
 colorFrom: purple
 colorTo: gray
 sdk: gradio
-sdk_version: 4.26.0
 app_file: app.py
-pinned: false
 license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Saccharomyces Pythia
+emoji: 🍄
 colorFrom: purple
 colorTo: gray
 sdk: gradio
+sdk_version: 4.25.0
 app_file: app.py
+pinned: true
 license: apache-2.0
 ---
+# saccharomyces-pythia: an ML/AI-integrated *Saccharomyces cerevisiae* assistant
+## Table of Contents
+1. [Introduction](#introduction)
+2. [SacCerML: the base ML model](#saccerml-the-base-ml-model)
+    - [Training](#training)
+        * [Data and preprocessing](#data-and-preprocessing)
+        * [Validation](#validation)
+    - [Testing](#testing)
+3. [saccharomyces-pythia: gene calling and AI integration](#saccharomyces-pythia-gene-calling-and-ai-integration)
+4. [Try it out!](#try-it-out)
+5. [References](#references)
+6. [License](#license)
+## Introduction
+**saccharomyces-pythia** is the new, rebranded v1.0.0 of SacCerML. Initially conceived as a Python script that leveraged machine learning and bioinformatics tools to predict genes in Saccharomyces cerevisiae (baker's yeast) genomic sequences, it is now a complete and AI-integrated tool that can help researchers both as a chatbot and as a ORF-predicter.
+## SacCerML: the base ML model
+### Training
+#### Data and preprocessing
+All the annotated coding DNA sequences for *S. cerevisiae* (strain S288C) were downloaded from Saccharomyces Genome Database.
+These genetic sequences were split according to their ORF classification (verified, dubious, uncharacterized, pseudogene and transposable element) and for each of them the following parameters were calculated:
+- Codon Adaptation Index
+- Checksum
+After that, DNA was translated into aminoacids and other descriptors were retrieved:
+- Hydrophobicity
+- Isoelectric point
+- Aromaticity
+- Instability
+- Molecular weight
+- Secondary structure percentage (helix, turn and sheet)
+- Molar extinction (both oxidized and reduced)
+All the computed data were stored in a csv file, which was used to train a supervised ML model, a Voting Classifier (implemented in scikit-learn package), made up by HistGradient Boosting Classifier, a Decision Tree Clasifier and an Extra Tree Classifier.
+#### Validation
+The so-obtained machine-larning model (called SacCerML) was then evaluated on the entire training set, yielding a 99.93% accuracy. A key component of the training was k-fold crossvalidation. SacCerML was trained on increasingly wider percentages of the training data and tested on the remainder: it yielded a high accuracy (>84%) in all the tests, and the same goes for recall, f1 and precision score. From the classification reports it could be already seen a slight bias towards predicting verified and dubious ORFs, with more difficulty in predicting uncharacterized ORFs.
+### Testing
+Data were collected from ORFs of 10 *Saccharomyces cerevisiae* strains, different from the one used for training:
+- AWRI1631
+- BC187
+- BY4741
+- CBS7960
+- FL100
+- g833-1B
+- Kyokai7
+- LalvinQA23
+- Vin13
+- YS9
+A total of 54452 transcripts were collected and processed into csv file by extracting the previously mentioned features. The model performed well, it had overall accuracy, f1, precision and recall score always above 86%. Nevertheless, the slight bias towards verified and dubious ORFs was confirmed, though uncharacterized ORFs too were well detected in several tests.
+## saccharomyces-pythia: gene calling and AI integration
+SacCerML has now reached a new stage of its development (v1.0.0), where it has been rebranded as **saccharomyces-pythia**. You can now enjoy the following upgrades, that make it user-friendly and easy to install:
+- `Gradio <https://www.gradio.app/>`_ chatbot interface running completely locally on your computer
+- Gene calling with automated ORF detection thanks to `orfipy <https://pypi.org/project/orfipy/>`_: no need for preprocessing your reads, just upload one or more FASTA files with *S. cerevisiae* DNA sequences to the chatbot.
+- AI assistant, built upon `EleutherAI/pythia-160-deduped-v0 <https://huggingface.co/EleutherAI/pythia-160m-deduped-v0>`_ finetuned on *Saccharomyces cerevisiae and its industrial applications* (Parapouli et al., 2020): this is a text-generation model that will reply to researcher questions (stil a beta feature, will become more stable in future releases).
+- Docker image to download and run the application on your computer
+## Try it out!
+Use the following commands to run **saccharomyces-pythia** on your computer:
+```bash
+docker pull ghcr.io/astrabert/saccharomyces-pythia:latest
+docker run -p 7860:7860 ghcr.io/astrabert/saccharomyces-pythia:latest
+```
+Just wait 30s-1min, the app should then be running on port 0.0.0.0:7860 (Linux-based) or localhost:7860 (Windows-based).
+## References
+* Saccharomyces Genome Database: <https://www.yeastgenome.org/>
+* Biopython: <https://biopython.org/>
+* Scikit-learn: <https://scikit-learn.org/stable/>
+* Gradio: <https://www.gradio.app/>
+* orfipy: <https://pypi.org/project/orfipy/>
+* EleutherAI/pythia-160-deduped-v0: <https://huggingface.co/EleutherAI/pythia-160m-deduped-v0>
+* Parapouli et al., 2020: <https://doi.org/10.3934/microbiol.2020001>
+Additionally, the following libraries and packages were used in the development of the machine learning model:
+* NumPy: <https://numpy.org/>
+* Pandas: <https://pandas.pydata.org/>
+These libraries and packages were used for data manipulation, analysis, and model training.
+## License
+The project is hereby provided under MIT license.
+If you are using saccharomyces-pythia for your work, please consider citing its author, [Astra Bertelli](https://astrabert.vercel.app)
+*How was this README generated? Leveraging the power of AI with reAIdme, an HuggingChat assistant based on meta-llama/Llama-2-70b-chat-hf. Go and give it a try at this link: <https://hf.co/chat/assistant/660d9a4f590a7924eed02a32!> 🤖*

app.py ADDED Viewed

	@@ -0,0 +1,69 @@

+import gradio as gr
+import os
+import time
+from transformers import pipeline
+from predict import *
+from load_model import *
+def print_like_dislike(x: gr.LikeData):
+    print(x.index, x.value, x.liked)
+def add_message(history, message):
+    if len(message["files"]) > 0:
+        history.append((message["files"], None))
+    if message["text"] is not None and message["text"] != "":
+        history.append((message["text"], None))
+    return history, gr.MultimodalTextbox(value=None, interactive=False)
+def bot(history):
+    global tsk
+    if type(history[-1][0]) != tuple:
+        try:
+            pipe = pipeline("text-generation", tokenizer=tokenizer, model=model)
+            response = pipe(history[-1][0])[0]
+            response = response["generated_text"]
+            history[-1][1] = ""
+            for character in response:
+                history[-1][1] += character
+                time.sleep(0.05)
+                yield history
+        except Exception as e:
+            response = f"Sorry, the error '{e}' occured while generating the response; check [troubleshooting documentation](https://astrabert.github.io/everything-rag/#troubleshooting) for more"
+    if type(history[-1][0]) == tuple:
+        filelist = []
+        for i in history[-1][0]:
+            filelist.append(i)
+        if len(filelist) > 1:
+            finalfasta = merge_fastas(filelist)
+        else:
+            finalfasta = filelist[0]
+        response = predict_genes(finalfasta)
+        history[-1][1] = ""
+        for character in response:
+            history[-1][1] += character
+            time.sleep(0.05)
+            yield history
+with gr.Blocks() as demo:
+    chatbot = gr.Chatbot(
+        [[None, " Welcome to Saccharomyces-Pythia, your helpful assistant for all things Saccharomyces cerevisiae! I am here to provide you with fascinating facts about this important model organism, as well as aid in the prediction of open reading frames (ORFs) and their corresponding types from any S. cerevisiae genetic sequence you may have. Simply upload your FASTA file, and let me work my magic. Rest assured, accuracy and efficiency are at the core of my design. Prepare to be enlightened on the wonders of yeast genomics and beyond. Let's get started!"]],
+        label="Saccharomyces-Pythia",
+        elem_id="chatbot",
+        bubble_full_width=False,
+    )
+    chat_input = gr.MultimodalTextbox(interactive=True, file_types=["pdf"], placeholder="Enter message or upload file...", show_label=False)
+    chat_msg = chat_input.submit(add_message, [chatbot, chat_input], [chatbot, chat_input])
+    bot_msg = chat_msg.then(bot, chatbot, chatbot, api_name="bot_response")
+    bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])
+    chatbot.like(print_like_dislike, None, None)
+    clear = gr.ClearButton(chatbot)
+demo.queue()
+if __name__ == "__main__":
+    demo.launch()

load_model.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_checkpoint = "as-cle-bert/saccharomyces-pythia-v1"
+model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model.py ADDED Viewed

	@@ -0,0 +1,281 @@

+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.ensemble import VotingClassifier
+from sklearn.ensemble import VotingClassifier, HistGradientBoostingClassifier, ExtraTreesClassifier
+from sklearn.tree import DecisionTreeClassifier
+from Bio.SeqUtils.ProtParam import ProteinAnalysis
+from Bio.SeqUtils.CheckSum import crc32
+from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex
+from Bio.SeqUtils.CodonUsageIndices import SharpEcoliIndex
+from Bio.SeqUtils import six_frame_translations
+from Bio.Seq import Seq
+from Bio import SeqIO
+import gzip
+from math import floor
+from sklearn.metrics import accuracy_score
+from orfipy_core import orfs
+import sys
+import matplotlib.pyplot as plt
+def load_data(infile):
+    """Load data from infile if it is in fasta format (after having unzipped it, if it is zipped)"""
+    if infile.endswith(".gz"):  # If file is gzipped, unzip it
+        y = gzip.open(infile, "rt", encoding="latin-1")
+        # Read file as fasta if it is fasta
+        if (
+            infile.endswith(".fasta.gz")
+            or infile.endswith(".fna.gz")
+            or infile.endswith(".fas.gz")
+            or infile.endswith(".fa.gz")
+        ):
+            records = SeqIO.parse(y, "fasta")
+            sequences = {}
+            for record in records:
+                sequences.update({str(record.id): str(record.seq)})
+            y.close()
+            return sequences
+        else:
+            y.close()
+            raise ValueError("File is the wrong format")
+    # Read file directly as fasta if it is a not zipped fasta: handle also more uncommon extensions :-)
+    elif (
+        infile.endswith(".fasta")
+        or infile.endswith(".fna")
+        or infile.endswith(".fas")
+        or infile.endswith(".fa")
+    ):
+        with open(infile, "r") as y:
+            records = SeqIO.parse(y, "fasta")
+            sequences = {}
+            for record in records:
+                sequences.update({str(record.id): str(record.seq)})
+            y.close()
+            return sequences
+    else:
+        raise ValueError("File is the wrong format")
+def calculate_cai(dna, index=SharpEcoliIndex):
+    cai = CodonAdaptationIndex()
+    cai.set_cai_index(index)
+    if len(dna) % 3 == 0:
+        a = cai.cai_for_gene(dna)
+    else:
+        six_translated = six_frame_translations(dna)
+        n = six_translated.split("\n")
+        frames = {
+            "0;F": n[5],
+            "1;F": n[6],
+            "2;F": n[7],
+            "0;R": n[12],
+            "1;R": n[11],
+            "2;R": n[10],
+        }
+        ind = 0
+        for i in list(frames.keys()):
+            k = frames[i].replace(" ", "")
+            if "M" in k and "*" in k:
+                if i.split(";")[0] == "F" and k.index("M") < k.index("*"):
+                    if len(k) <= len(dna) / 3:
+                        ind = int(i.split("")[0])
+                        break
+                elif i.split(";")[0] == "R" and k.index("M") > k.index("*"):
+                    if len(k) <= len(dna) / 3:
+                        ind = len(dna) - int(i.split("")[0])
+                        break
+        if ind == 0:
+            cods = 3 * floor(len(dna) / 3)
+            dna = dna[:cods]
+            a = cai.cai_for_gene(dna)
+        elif 1 <= ind <= 2:
+            if len(dna[ind:]) % 3 == 0:
+                dna = dna[ind:]
+            else:
+                cods = 3 * floor((len(dna) - ind) / 3)
+                dna = dna[ind : cods + ind]
+                a = cai.cai_for_gene(dna)
+        else:
+            if len(dna[:ind]) % 3 == 0:
+                dna = dna[ind:]
+            else:
+                cods = 3 * floor((len(dna) - ind) / 3)
+                dna = dna[:cods]
+                a = cai.cai_for_gene(dna)
+    return a
+def checksum(dna):
+    return crc32(dna)
+def hidrophobicity(dna):
+    protein_sequence = str(Seq(dna).translate())
+    protein_sequence = protein_sequence.replace("*", "")
+    hydrophobicity_score = ProteinAnalysis(protein_sequence).gravy()
+    return hydrophobicity_score
+def isoelectric_pt(dna):
+    protein_sequence = str(Seq(dna).translate())
+    protein_sequence = protein_sequence.replace("*", "")
+    isoelectric = ProteinAnalysis(protein_sequence).isoelectric_point()
+    return isoelectric
+def aromatic(dna):
+    protein_sequence = str(Seq(dna).translate())
+    protein_sequence = protein_sequence.replace("*", "")
+    arom = ProteinAnalysis(protein_sequence).aromaticity()
+    return arom
+def instable(dna):
+    protein_sequence = str(Seq(dna).translate())
+    protein_sequence = protein_sequence.replace("*", "")
+    inst = ProteinAnalysis(protein_sequence).instability_index()
+    return inst
+def weight(dna):
+    protein_sequence = str(Seq(dna).translate())
+    protein_sequence = protein_sequence.replace("*", "")
+    wgt = ProteinAnalysis(protein_sequence).molecular_weight()
+    return wgt
+def sec_struct(dna):
+    protein_sequence = str(Seq(dna).translate())
+    protein_sequence = protein_sequence.replace("*", "")
+    second_struct = ProteinAnalysis(protein_sequence).secondary_structure_fraction()
+    return ",".join([str(s) for s in second_struct])
+def mol_ext(dna):
+    protein_sequence = str(Seq(dna).translate())
+    protein_sequence = protein_sequence.replace("*", "")
+    molar_ext = ProteinAnalysis(protein_sequence).molar_extinction_coefficient()
+    return ",".join([str(s) for s in molar_ext])
+def longest_orf(coding):
+    keys_M_starting = [
+        key
+        for key in list(coding.keys())
+        if str(Seq(coding[key]).translate()).startswith("M")
+    ]
+    M_starting = [
+        seq
+        for seq in list(coding.values())
+        if str(Seq(seq).translate()).startswith("M")
+    ]
+    lengths = [len(seq) for seq in M_starting]
+    max_ind = lengths.index(max(lengths))
+    return {keys_M_starting[max_ind]: M_starting[max_ind]}
+def predict_orf(seq, minlen=45, maxlen=18000, longest_M_starting_orf_only=True):
+    ls = orfs(seq, minlen=minlen, maxlen=maxlen)
+    coding = {}
+    count = 0
+    for start, stop, strand, description in ls:
+        count += 1
+        coding.update({f"ORF.{count}": seq[int(start) : int(stop)]})
+    if longest_M_starting_orf_only:
+        print(
+            "\n---------------------------\nWarning: option longest_M_starting_orf_only is set to True and thus you will get only the longest M-starting ORF; to get all the ORFs, set it to False\n---------------------------\n",
+            file=sys.stderr,
+        )
+        return longest_orf(coding)
+    return coding
+def process_dna(fasta_file):
+    fas = load_data(fasta_file)
+    seqs = [seq for seq in list(fas.values())]
+    heads = [seq for seq in list(fas.keys())]
+    data = {}
+    proteins = {}
+    for i in range(len(seqs)):
+        coding = predict_orf(seqs[i])
+        open_reading_frames = list(coding.keys())
+        for key in open_reading_frames:
+            head = f"{heads[i]}.{key}"
+            proteins.update({head: str(Seq(coding[key]).translate())})
+            cai = calculate_cai(coding[key])
+            cksm = checksum(coding[key])
+            hydr = hidrophobicity(coding[key])
+            isl = isoelectric_pt(coding[key])
+            arm = aromatic(coding[key])
+            inst = instable(coding[key])
+            mw = weight(coding[key])
+            se_st = sec_struct(coding[key]).split(",")
+            se_st1 = se_st[0]
+            se_st2 = se_st[1]
+            se_st3 = se_st[2]
+            me = mol_ext(coding[key]).split(",")
+            me1 = me[0]
+            me2 = me[1]
+            n = pd.DataFrame(
+                {
+                    "CAI": [cai],
+                    "CHECKSUM": [cksm],
+                    "HIDROPHOBICITY": [hydr],
+                    "ISOELECTRIC": [isl],
+                    "AROMATIC": [arm],
+                    "INSTABLE": [inst],
+                    "MW": [mw],
+                    "HELIX": [se_st1],
+                    "TURN": [se_st2],
+                    "SHEET": [se_st3],
+                    "MOL_EXT_RED": [me1],
+                    "MOL_EXT_OX": [me2],
+                }
+            )
+            data.update({head: n})
+    return data, proteins
+if __name__ == "__main__":
+    print("Loading data...")
+    # Load the data from the CSV file
+    data = pd.read_csv("../../data/scerevisiae.csv")
+    print("Loaded data")
+    print("Generating training and test data...")
+    # Features
+    X = data.iloc[:, 1:]
+    # Labels
+    y = data["ORF_TYPE"]
+    # Split the data into training and testing sets
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=0.2, random_state=42
+    )
+    print("Generated training and test data")
+    print("Building and training the model...")
+    # Create and train the Random Forest classifier
+    clf4 = DecisionTreeClassifier()
+    clf7 = HistGradientBoostingClassifier()
+    clf8 = ExtraTreesClassifier()
+    classifier = VotingClassifier([('dt', clf4), ('hgb', clf7), ('etc', clf8)], voting='hard')
+    model = classifier.fit(X, y)  # Uncomment this line if clf needs training
+    # Make predictions on the test set
+    y_pred = model.predict(X)
+    # Evaluate the accuracy of the model
+    accuracy = accuracy_score(y, y_pred)
+    print(f"Accuracy: {accuracy}")
+    from joblib import dump
+    print("Saving model...")
+    dump(model, "SacCerML.joblib")
+    print("Saved")
+    print("All done")

predict.py ADDED Viewed

	@@ -0,0 +1,40 @@

+from joblib import load
+from model import process_dna
+loaded_model = load("SacCerML.joblib")
+def merge_fastas(fileslist):
+    finale = []
+    finalfile = fileslist[-1].split(".")[0]+"_mergedfastas.fasta"
+    for fl in fileslist:
+        f = open(fl, "r")
+        lines = f.readlines()
+        f.close()
+        for line in lines:
+            finale.append(line)
+    fnlfl = open(finalfile, "w")
+    for l in finale:
+        if l.endswith("\n"):
+            fnlfl.write(l)
+        else:
+            fnlfl.write(l+"\n")
+    fnlfl.close()
+    return finalfile
+def predict_genes(infile, model=loaded_model):
+    X, proteins = process_dna(infile)
+    headers = list(X.keys())
+    predictions = []
+    for x in list(X.values()):
+        p = model.predict(x)
+        predictions.append(p)
+    msg = []
+    for i in range(len(predictions)):
+        msg.append(
+            f"{headers[i]} protein sequence is\n{proteins[headers[i]]}\nand is predicted as {predictions[i][0]}\n"
+        )
+    message = "".join(msg)
+    return message

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+biopython==1.81
+orfipy==0.0.4
+scikit-learn==1.2.2
+pandas==2.0.3
+gradio==4.25.0
+transformers
+trl
+peft