--- tags: - text-classification - transformers - biobert - miRNA - biomedical - LoRA - fine-tuning library_name: transformers datasets: - custom-biomedical-dataset license: apache-2.0 --- # 🧬 miRNA-BioBERT: Fine-Tuned BioBERT for miRNA Sentence Classification **Fine-tuned BioBERT model for classifying miRNA-related sentences in biomedical research papers.** --- ## 📌 Overview **miRNA-BioBERT** is a fine-tuned version of [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1), trained specifically for **classifying sentences** as **miRNA-related (relevant) or not (irrelevant)**. The model is useful for **automating literature reviews**, **extracting relevant sentences**, and **identifying key insights** in genomic research. ✔ **Base Model**: `dmis-lab/biobert-base-cased-v1.1` ✔ **Fine-tuning Method**: **LoRA (Low-Rank Adaptation)** ✔ **Dataset**: **Curated biomedical text corpus containing labeled miRNA-relevant and non-relevant sentences** ✔ **Task**: **Binary classification (1 = functional, 0 = non-functional)** ✔ **Trained on**: **RTX A6000 GPU (5 epochs, batch size 32, learning rate 2e-5)** ## 🚀 How to Use the Model ### 1️⃣ Install Dependencies ```bash pip install transformers torch ``` ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch # Load the model and tokenizer model_name = "debjit20504/miRNA-biobert" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Move model to GPU or MPS (for Mac) device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() def classify_text(text): inputs = tokenizer(text, return_tensors="pt").to(device) with torch.no_grad(): output = model(**inputs) label = torch.argmax(output.logits, dim=1).item() return "functional" if label == 1 else "Non-functional" # Example Test sample_text = "The results showed that miR-223-3p decreased in glioblastoma tissue but NLRP3 increased." print(f"Classification: {classify_text(sample_text)}") ``` ## 📊 Training Details - Dataset: Biomedical text dataset with 429,785 relevant sentences and 87,966 irrelevant sentences. - Fine-Tuning Method: LoRA (Low-Rank Adaptation) for efficient training. - Training Hardware: NVIDIA RTX A6000 GPU. - Training Settings: - Batch size: 32 - Learning rate: 2e-5 - Optimizer: AdamW - Warmup steps: 1000 - Epochs: 5 - Mixed precision (fp16): ✅ Enabled for efficiency. --- ## 📖 Model Applications ✅ **Biomedical NLP** – Extracting meaningful information from biomedical literature. ✅ **miRNA Research** – Identifying sentences discussing miRNA mechanisms. ✅ **Automated Literature Review** – Filtering relevant studies efficiently. ✅ **Genomics & Bioinformatics** – Enhancing data retrieval from scientific texts. --- ## 📬 Contact For any questions or collaborations, reach out via: **📧 Email**: debjit.pramanik@postgrad.manchester.ac.uk **🔗 LinkedIn**: https://www.linkedin.com/in/debjit-pramanik-88a837171/