drAbreu commited on
Commit
e86139d
·
verified ·
1 Parent(s): 1f57b6a

Add comprehensive model card for SODA-VEC negative sampling model

Browse files
Files changed (1) hide show
  1. README.md +199 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SODA-VEC Negative Sampling: Biomedical Sentence Embeddings
2
+
3
+ ## Model Overview
4
+
5
+ **SODA-VEC Negative Sampling** is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature.
6
+
7
+ ## Key Features
8
+
9
+ - 🧬 **Biomedical Specialization**: Trained exclusively on PubMed abstracts and titles
10
+ - 🔬 **Large Scale**: 26.5M training pairs from complete PubMed baseline (July 2024)
11
+ - ⚡ **Modern Architecture**: Based on ModernBERT-embed-base with 768-dimensional embeddings
12
+ - 🎯 **Negative Sampling**: Uses standard MultipleNegativesRankingLoss for robust contrastive learning
13
+ - 📊 **Production Ready**: Optimized training with FP16, gradient clipping, and cosine scheduling
14
+
15
+ ## Model Details
16
+
17
+ ### Base Model
18
+ - **Architecture**: ModernBERT-embed-base (nomic-ai/modernbert-embed-base)
19
+ - **Embedding Dimension**: 768
20
+ - **Max Sequence Length**: 768 tokens
21
+ - **Parameters**: ~110M
22
+
23
+ ### Training Configuration
24
+ - **Loss Function**: MultipleNegativesRankingLoss (sentence-transformers)
25
+ - **Training Data**: 26,473,900 biomedical text pairs
26
+ - **Epochs**: 3
27
+ - **Effective Batch Size**: 256 (32 per GPU × 4 GPUs × 2 gradient accumulation)
28
+ - **Learning Rate**: 1e-5 with cosine scheduling
29
+ - **Optimization**: AdamW with weight decay (0.01)
30
+ - **Precision**: FP16 for efficiency
31
+ - **Hardware**: 4x Tesla V100-DGXS-32GB
32
+
33
+ ## Dataset
34
+
35
+ ### Source Data
36
+ - **Origin**: Complete PubMed baseline (July 2024)
37
+ - **Content**: Scientific abstracts and titles from biomedical literature
38
+ - **Quality**: 99.7% retention after filtering (128-6,000 character abstracts)
39
+ - **Splits**: 99.6% train / 0.2% validation / 0.2% test
40
+
41
+ ### Data Processing
42
+ - Error pattern removal and quality filtering
43
+ - Balanced train/validation/test splits
44
+ - Character length filtering for optimal training
45
+ - Duplicate detection and removal
46
+
47
+ ## Performance & Use Cases
48
+
49
+ ### Intended Applications
50
+ - **Literature Search**: Semantic search across biomedical publications
51
+ - **Research Discovery**: Finding related papers and concepts
52
+ - **Knowledge Mining**: Extracting relationships from scientific text
53
+ - **Document Classification**: Categorizing biomedical documents
54
+ - **Similarity Analysis**: Comparing research abstracts and papers
55
+
56
+ ### Biomedical Domains
57
+ - Molecular Biology
58
+ - Clinical Medicine
59
+ - Pharmacology
60
+ - Genetics & Genomics
61
+ - Biochemistry
62
+ - Neuroscience
63
+ - Public Health
64
+
65
+ ## Usage
66
+
67
+ ### Installation
68
+ ```bash
69
+ pip install sentence-transformers
70
+ ```
71
+
72
+ ### Basic Usage
73
+ ```python
74
+ from sentence_transformers import SentenceTransformer
75
+
76
+ # Load the model
77
+ model = SentenceTransformer('EMBO/soda-vec-negative-sampling')
78
+
79
+ # Encode biomedical texts
80
+ texts = [
81
+ "CRISPR-Cas9 gene editing in human embryos",
82
+ "mRNA vaccine efficacy against COVID-19 variants",
83
+ "Protein folding mechanisms in neurodegenerative diseases"
84
+ ]
85
+
86
+ embeddings = model.encode(texts)
87
+ print(f"Embeddings shape: {embeddings.shape}") # (3, 768)
88
+ ```
89
+
90
+ ### Semantic Search
91
+ ```python
92
+ import numpy as np
93
+ from sklearn.metrics.pairwise import cosine_similarity
94
+
95
+ # Query and corpus
96
+ query = "Alzheimer's disease biomarkers"
97
+ corpus = [
98
+ "Tau protein aggregation in neurodegeneration",
99
+ "COVID-19 vaccine development strategies",
100
+ "Beta-amyloid plaques in dementia patients"
101
+ ]
102
+
103
+ # Encode
104
+ query_embedding = model.encode([query])
105
+ corpus_embeddings = model.encode(corpus)
106
+
107
+ # Find most similar
108
+ similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
109
+ best_match = np.argmax(similarities)
110
+ print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})")
111
+ ```
112
+
113
+ ## Training Details
114
+
115
+ ### Loss Function
116
+ The model uses **MultipleNegativesRankingLoss**, which:
117
+ - Treats all other samples in a batch as negatives
118
+ - Optimizes for high similarity between related texts
119
+ - Provides robust contrastive learning without explicit negative sampling
120
+ - Well-established in sentence-transformers ecosystem
121
+
122
+ ### Training Process
123
+ - **Duration**: ~4 days on 4x V100 GPUs
124
+ - **Steps**: 310,239 total training steps
125
+ - **Evaluation**: Every 1000 steps (310 evaluations, 1.8% overhead)
126
+ - **Monitoring**: Real-time TensorBoard logging
127
+ - **Checkpointing**: Model saved at end of each epoch
128
+
129
+ ### Optimization Features
130
+ - Gradient clipping (max_norm=5.0) for training stability
131
+ - Weight decay regularization for generalization
132
+ - Cosine learning rate scheduling
133
+ - Loss-only evaluation for efficiency
134
+ - Reproducible training (seed=42)
135
+
136
+ ## Technical Specifications
137
+
138
+ ### Hardware Requirements
139
+ - **Training**: 4x Tesla V100-DGXS-32GB (recommended)
140
+ - **Inference**: Any GPU with 4GB+ VRAM, or CPU
141
+ - **Memory**: ~2GB GPU memory for inference
142
+
143
+ ### Software Dependencies
144
+ - sentence-transformers >= 2.0.0
145
+ - transformers >= 4.20.0
146
+ - torch >= 1.12.0
147
+ - Python >= 3.8
148
+
149
+ ## Comparison with SODA-VEC (VICReg)
150
+
151
+ | Feature | SODA-VEC (VICReg) | SODA-VEC Negative Sampling |
152
+ |---------|-------------------|----------------------------|
153
+ | Loss Function | VICReg (custom biomedical) | MultipleNegativesRankingLoss |
154
+ | Optimization | Empirically tuned coefficients | Standard contrastive learning |
155
+ | Training Data | Same (26.5M pairs) | Same (26.5M pairs) |
156
+ | Use Case | Biomedical research focus | General semantic similarity |
157
+ | Framework | Custom implementation | sentence-transformers standard |
158
+
159
+ ## Limitations
160
+
161
+ - **Domain Specificity**: Optimized for biomedical text, may not generalize to other domains
162
+ - **Language**: English-only training data
163
+ - **Recency**: Training data cutoff at July 2024
164
+ - **Bias**: May reflect biases present in PubMed literature
165
+
166
+ ## Citation
167
+
168
+ If you use this model in your research, please cite:
169
+
170
+ ```bibtex
171
+ @misc{soda-vec-negative-sampling-2024,
172
+ title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings},
173
+ author={EMBO},
174
+ year={2024},
175
+ url={https://huggingface.co/EMBO/soda-vec-negative-sampling},
176
+ note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss}
177
+ }
178
+ ```
179
+
180
+ ## License
181
+
182
+ This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details.
183
+
184
+ ## Acknowledgments
185
+
186
+ - **Base Model**: nomic-ai/modernbert-embed-base
187
+ - **Training Framework**: sentence-transformers
188
+ - **Data Source**: PubMed/MEDLINE database
189
+ - **Infrastructure**: EMBO computational resources
190
+
191
+ ## Model Card Contact
192
+
193
+ For questions about this model, please contact EMBO or open an issue in the associated repository.
194
+
195
+ ---
196
+
197
+ **Last Updated**: August 2024
198
+ **Model Version**: 1.0
199
+ **Training Completion**: In Progress (ETA: 4 days)