File size: 6,709 Bytes
41b0ae4 92413be 41b0ae4 5f3945c 41b0ae4 378b290 41b0ae4 dc96101 41b0ae4 dc96101 41b0ae4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
tags:
- sentence-transformers
- sparse-encoder
- sparse
- splade
- generated_from_trainer
- loss:SpladeLoss
- loss:SparseMultipleNegativesRankingLoss
- loss:FlopsLoss
base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- pearson_cosine
- spearman_cosine
- active_dims
- sparsity_ratio
model-index:
- name: SPLADE Sparse Encoder
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
type: pubmed-similarity
name: PubMed Similarity
metrics:
- type: pearson_cosine
value: 0.9422980731390805
name: Pearson Cosine
- type: spearman_cosine
value: 0.8870061609483617
name: Spearman Cosine
- type: active_dims
value: 34.0018196105957
name: Active Dims
- type: sparsity_ratio
value: 0.9988859897906233
name: Sparsity Ratio
language: en
license: apache-2.0
---
# PubMedBERT SPLADE
This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model finetuned from [PubMedBERT-base](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) using [sentence-transformers](https://www.SBERT.net). It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
The training dataset was generated using a random sample of [PubMed](https://pubmed.ncbi.nlm.nih.gov/) title-abstract pairs along with similar title pairs.
PubMedBERT SPLADE produces higher quality sparse embeddings than generalized models for medical literature. Further fine-tuning for a medical subdomain will result in even better performance.
## Usage (txtai)
This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
_Note: txtai 9.0+ is required for sparse vector scoring support_
```python
import txtai
embeddings = txtai.Embeddings(
sparse="neuml/pubmedbert-base-splade",
content=True
)
embeddings.index(documents())
# Run a query
embeddings.search("query to run")
```
## Usage (Sentence-Transformers)
Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
```python
from sentence_transformers import SparseEncoder
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SparseEncoder("neuml/pubmedbert-base-splade")
embeddings = model.encode(sentences)
print(embeddings)
```
## Evaluation Results
Performance of this model compared to the top base models on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.
The following datasets were used to evaluate model performance.
- [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
- Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
- Split: test, Pair: (title, text)
- [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
- Subset: pubmed, Split: validation, Pair: (article, abstract)
Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
| Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
| ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.92 | 94.07 | 93.46 |
| [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5) | 91.02 | 95.82 | 94.49 | 93.78 |
| [gte-base](https://hf.co/thenlper/gte-base) | 92.97 | 96.90 | 96.24 | 95.37 |
| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 93.27 | 97.00 | 96.58 | 95.62 |
| [**pubmedbert-base-splade**](https://hf.co/neuml/pubmedbert-base-splade) | **90.76** | **96.20** | **95.87** | **94.28** |
| [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 90.86 | 93.68 | 93.54 | 92.69 |
While this model was't the highest scoring model using the Pearson metric, it does well when measured by [Spearman rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).
| Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
| ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 85.77 | 86.52 | 86.32 | 86.20 |
| [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5) | 85.71 | 86.58 | 86.35 | 86.21 |
| [gte-base](https://hf.co/thenlper/gte-base) | 86.44 | 86.60 | 86.55 | 86.53 |
| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 86.29 | 86.57 | 86.47 | 86.44 |
| [**pubmedbert-base-splade**](https://hf.co/neuml/pubmedbert-base-splade) | **86.80** | **89.12** | **88.60** | **88.17** |
| [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 85.71 | 86.37 | 86.13 | 86.07 |
This indicates that the SPLADE model may do a better job of calculating scores/rankings in the correct direction.
### Full Model Architecture
```
SparseEncoder(
(0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)
```
## More Information
The training data for this model is the same as described in [this article](https://medium.com/neuml/embeddings-for-medical-literature-74dae6abf5e0). See [this article](https://huggingface.co/blog/train-sparse-encoder) for more on the training scripts.
|