--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:9623924 - loss:MSELoss base_model: BAAI/bge-m3 widget: - source_sentence: That is a happy person sentences: - That is a happy dog - That is a very happy person - Today is a sunny day pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - pearson_cosine - spearman_cosine - negative_mse model-index: - name: SentenceTransformer based on BAAI/bge-m3 results: - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts dev type: sts-dev metrics: - type: pearson_cosine value: 0.9691269661048901 name: Pearson Cosine - type: spearman_cosine value: 0.9650087926361528 name: Spearman Cosine - task: type: knowledge-distillation name: Knowledge Distillation dataset: name: Unknown type: unknown metrics: - type: negative_mse value: -0.006388394831446931 name: Negative Mse - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts test type: sts-test metrics: - type: pearson_cosine value: 0.9691398285942048 name: Pearson Cosine - type: spearman_cosine value: 0.9650683134098942 name: Spearman Cosine --- # 8-layer distillation from BAAI/bge-m3 with2.5x speedup This is an embedding model distilled from [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) on a combination of public and proprietary datasets. It is a 8-layer model --instead of 24 layers) in 366m-parameter size and achieves 2.5x speedup with little-to-no loss in retrieval performance. ## Motivation We are a team that have developed some of the real use cases of semantic search and RAG, and no other models apart from `BAAI/bge-m3` have proved to be useful in a variety of domains and use cases, especially in multimodal settings. However, it's extra large and prohibitively expensive to serve for large user groups with a low latency and/or index large volumes of data. That's why we wanted the same retrieval performance in a smaller model size and with higher speed. We composed a large and diverse dataset of 10m texts and applied a knowledge distillation technique that reduced the number of layers from 24 to 8. The results were surprisingly promising --we achieved a Spearman Cosine score of 0.965 and MSE of 0.006 in the test subset, which can be even taken to be within numerical error ranges. We couldn't observe a considerable degredation in our qualitative tests, either. Finally, we measured a 2.5x throughput increase (454 texts / sec instead of 175 texts / sec, measured on a T4 Colab GPU). ## Future Work Even though our training dataset was composed of diverse texts in Turkish, the model retained a considerable performance in other languages as well --we measured a Spearman Cosine score of 0.938 in a collection 10k texts in English, for example. This performance retention motivated us to work on the second version of this distillation model trained on a larger and multilingual dataset as well as an even smaller distillation. Stay tuned for these updates, and feel free to reach out to us for collaboration options. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Cosine Similarity - **Training Dataset:** 10m texts from diverse domains ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("altaidevorg/bge-m3-distill-8l") # Run inference sentences = [ 'That is a happy person', 'That is a happy dog', 'That is a very happy person', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 1024] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Evaluation ### Metrics #### Semantic Similarity * Datasets: `sts-dev` and `sts-test` * Evaluated with [EmbeddingSimilarityEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) | Metric | sts-dev | sts-test | |:--------------------|:----------|:-----------| | pearson_cosine | 0.9691 | 0.9691 | | **spearman_cosine** | **0.965** | **0.9651** | #### Knowledge Distillation * Evaluated with [MSEEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.MSEEvaluator) | Metric | Value | |:-----------------|:------------| | **negative_mse** | **-0.0064** | ## Training Details ### Training Dataset * Size: 9,623,924 training samples * Columns: sentence and label * Approximate statistics based on the first 1000 samples: | | sentence | label | |:--------|:-----------------------------------------------------------------------------------|:--------------------------------------| | type | string | list | | details | | | ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MSELoss ```bibtex @inproceedings{reimers-2020-multilingual-sentence-bert, title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2020", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2004.09813", } ``` #### bge-m3 ```bibtex @misc{bge-m3, title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```