Inference-free SPLADE distilbert-base-uncased trained on Natural-Questions tuples
This is a Asymmetric Inference-free SPLADE Sparse Encoder model finetuned from distilbert/distilbert-base-uncased using the sentence-transformers library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
Model Details
Model Description
- Model Type: Asymmetric Inference-free SPLADE Sparse Encoder
- Base model: distilbert/distilbert-base-uncased
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 30522 dimensions
- Similarity Function: Dot Product
- Language: en
- License: apache-2.0
Model Sources
- Documentation: Sentence Transformers Documentation
- Documentation: Sparse Encoder Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sparse Encoders on Hugging Face
Full Model Architecture
SparseEncoder(
(0): Router(
(sub_modules): ModuleDict(
(query): Sequential(
(0): SparseStaticEmbedding({'frozen': False}, dim=30522, tokenizer=DistilBertTokenizerFast)
)
(document): Sequential(
(0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'})
(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)
)
)
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SparseEncoder
# Download from the 🤗 Hub
model = SparseEncoder("monkeypostulate/inference-free-splade-distilbert-base-uncased-nq")
# Run inference
queries = [
"\u00bfHay una s\u00e1bana de algod\u00f3n ajustada disponible en tama\u00f1o queen?",
]
documents = [
'Pizuna 400 Thread Count Cotton Fitted-Sheet Queen Size White 1pc, 100% Long Staple Cotton Sateen Fitted Bed Sheet With All Around Elastic Deep Pocket Queen Sheets Fit Up to 15Inch (White Fitted Sheet)',
'ArtSocket Shower Curtain Teal Rustic Shabby Country Chic Blue Curtains Wood Rose Home Bathroom Decor Polyester Fabric Waterproof 72 x 72 Inches Set with Hooks',
'AFARER Case Compatible with Samsung Galaxy S7 5.1 inch, Military Grade 12ft Drop Tested Protective Case with Kickstand,Military Armor Dual Layer Protective Cover - Blue',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 30522] [3, 30522]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[13.2777, 7.2952, 2.9255]])
Evaluation
Metrics
Sparse Information Retrieval
- Dataset:
NanoMSMARCO
- Evaluated with
SparseInformationRetrievalEvaluator
Metric | Value |
---|---|
dot_accuracy@1 | 0.3 |
dot_accuracy@3 | 0.58 |
dot_accuracy@5 | 0.66 |
dot_accuracy@10 | 0.76 |
dot_precision@1 | 0.3 |
dot_precision@3 | 0.1933 |
dot_precision@5 | 0.132 |
dot_precision@10 | 0.076 |
dot_recall@1 | 0.3 |
dot_recall@3 | 0.58 |
dot_recall@5 | 0.66 |
dot_recall@10 | 0.76 |
dot_ndcg@10 | 0.5302 |
dot_mrr@10 | 0.4564 |
dot_map@100 | 0.4675 |
query_active_dims | 6.38 |
query_sparsity_ratio | 0.9998 |
corpus_active_dims | 813.6909 |
corpus_sparsity_ratio | 0.9733 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 89,000 training samples
- Columns:
query
anddocument
- Approximate statistics based on the first 1000 samples:
query document type string string details - min: 8 tokens
- mean: 21.52 tokens
- max: 44 tokens
- min: 8 tokens
- mean: 33.4 tokens
- max: 93 tokens
- Samples:
query document ¿Hay una lámpara de colgar con batería disponible?
Farmhouse Plug in Pendant Light with On/Off Switch Wire Caged Hanging Pendant Lamp 16ft Cord
¿Hay leggings con bolsillos disponibles para mujeres?
IUGA High Waist Yoga Pants with Pockets, Tummy Control, Workout Pants for Women 4 Way Stretch Yoga Leggings with Pockets
¿Cuál es la tapa de oscuridad marrón disponible?
Thicken It 100% Scalp Coverage Hair Powder - DARK BROWN - Talc-Free .32 oz. Water Resistant Hair Loss Concealer. Naturally Thicker Than Hair Fibers & Spray Concealers
- Loss:
SpladeLoss
with these parameters:{ "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)", "document_regularizer_weight": 0.003, "query_regularizer_weight": 0 }
Evaluation Dataset
Unnamed Dataset
- Size: 1,000 evaluation samples
- Columns:
query
anddocument
- Approximate statistics based on the first 1000 samples:
query document type string string details - min: 8 tokens
- mean: 20.94 tokens
- max: 40 tokens
- min: 8 tokens
- mean: 33.09 tokens
- max: 79 tokens
- Samples:
query document ¿Qué es un modelo anatómico del corazón?
Axis Scientific Heart Model, 2-Part Deluxe Life Size Human Heart Replica with 34 Anatomical Structures, Held Together with Magnets, Includes Mounted Display Base, Detailed Product Manual and Warranty
¿Hay un buscador de peces portátil disponible?
HawkEye Fishtrax 1C Fish Finder with HD Color Virtuview Display, Black/Red, 2" H x 1.6" W Screen Size
¿Hay un disfraz de Anna adulta de Frozen disponible para comprar?
Mitef Anime Cosplay Costume Princess Anna Fancy Dress with Shawl for Adult, L
- Loss:
SpladeLoss
with these parameters:{ "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)", "document_regularizer_weight": 0.003, "query_regularizer_weight": 0 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 256per_device_eval_batch_size
: 256learning_rate
: 2e-05warmup_ratio
: 0.1batch_sampler
: no_duplicatesrouter_mapping
: {'query': 'query', 'answer': 'document'}
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 256per_device_eval_batch_size
: 256per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torch_fusedoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsehub_revision
: Nonegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseliger_kernel_config
: Noneeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportionalrouter_mapping
: {'query': 'query', 'answer': 'document'}learning_rate_mapping
: {}
Training Logs
Epoch | Step | Training Loss | NanoMSMARCO_dot_ndcg@10 |
---|---|---|---|
0.5747 | 200 | 3.33 | - |
1.1494 | 400 | 2.755 | - |
-1 | -1 | - | 0.5302 |
Framework Versions
- Python: 3.9.6
- Sentence Transformers: 5.1.0
- Transformers: 4.55.0
- PyTorch: 2.8.0
- Accelerate: 1.10.0
- Datasets: 4.0.0
- Tokenizers: 0.21.4
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
SpladeLoss
@misc{formal2022distillationhardnegativesampling,
title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
year={2022},
eprint={2205.04733},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2205.04733},
}
SparseMultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
FlopsLoss
@article{paria2020minimizing,
title={Minimizing flops to learn efficient sparse representations},
author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
journal={arXiv preprint arXiv:2004.05665},
year={2020}
}
Model tree for monkeypostulate/inference-free-splade-distilbert-base-uncased-nq
Base model
distilbert/distilbert-base-uncasedEvaluation results
- Dot Accuracy@1 on NanoMSMARCOself-reported0.300
- Dot Accuracy@3 on NanoMSMARCOself-reported0.580
- Dot Accuracy@5 on NanoMSMARCOself-reported0.660
- Dot Accuracy@10 on NanoMSMARCOself-reported0.760
- Dot Precision@1 on NanoMSMARCOself-reported0.300
- Dot Precision@3 on NanoMSMARCOself-reported0.193
- Dot Precision@5 on NanoMSMARCOself-reported0.132
- Dot Precision@10 on NanoMSMARCOself-reported0.076
- Dot Recall@1 on NanoMSMARCOself-reported0.300
- Dot Recall@3 on NanoMSMARCOself-reported0.580