--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:9623924 - loss:MSELoss base_model: BAAI/bge-m3 widget: - source_sentence: That is a happy person sentences: - That is a happy dog - That is a very happy person - Today is a sunny day pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - pearson_cosine - spearman_cosine - negative_mse model-index: - name: SentenceTransformer based on BAAI/bge-m3 results: - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts dev type: sts-dev metrics: - type: pearson_cosine value: 0.9691269661048901 name: Pearson Cosine - type: spearman_cosine value: 0.9650087926361528 name: Spearman Cosine - task: type: knowledge-distillation name: Knowledge Distillation dataset: name: Unknown type: unknown metrics: - type: negative_mse value: -0.006388394831446931 name: Negative Mse - task: type: semantic-similarity name: Semantic Similarity dataset: name: sts test type: sts-test metrics: - type: pearson_cosine value: 0.9691398285942048 name: Pearson Cosine - type: spearman_cosine value: 0.9650683134098942 name: Spearman Cosine --- # 8-layer distillation from BAAI/bge-m3 with2.5x speedup This is an embedding model distilled from [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) on a combination of public and proprietary datasets. It is a 8-layer model --instead of 24 layers) in 366m-parameter size and achieves 2.5x speedup with little-to-no loss in retrieval performance. ## Motivation We are a team that have developed some of the real use cases of semantic search and RAG, and no other models apart from `BAAI/bge-m3` have proved to be useful in a variety of domains and use cases, especially in multimodal settings. However, it's extra large and prohibitively expensive to serve for large user groups with a low latency and/or index large volumes of data. That's why we wanted the same retrieval performance in a smaller model size and with higher speed. We composed a large and diverse dataset of 10m texts and applied a knowledge distillation technique that reduced the number of layers from 24 to 8. The results were surprisingly promising --we achieved a Spearman Cosine score of 0.965 and MSE of 0.006 in the test subset, which can be even taken to be within numerical error ranges. We couldn't observe a considerable degredation in our qualitative tests, either. Finally, we measured a 2.5x throughput increase (454 texts / sec instead of 175 texts / sec, measured on a T4 Colab GPU). ## Future Work Even though our training dataset was composed of diverse texts in Turkish, the model retained a considerable performance in other languages as well --we measured a Spearman Cosine score of 0.938 in a collection 10k texts in English, for example. This performance retention motivated us to work on the second version of this distillation model trained on a larger and multilingual dataset as well as an even smaller distillation. Stay tuned for these updates, and feel free to reach out to us for collaboration options. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Cosine Similarity - **Training Dataset:** - **License:** Proprietary ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("sentence_transformers_model_id") # Run inference sentences = [ 'That is a happy person', 'That is a happy dog', 'That is a very happy person', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 1024] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Evaluation ### Metrics #### Semantic Similarity * Datasets: `sts-dev` and `sts-test` * Evaluated with [EmbeddingSimilarityEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) | Metric | sts-dev | sts-test | |:--------------------|:----------|:-----------| | pearson_cosine | 0.9691 | 0.9691 | | **spearman_cosine** | **0.965** | **0.9651** | #### Knowledge Distillation * Evaluated with [MSEEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.MSEEvaluator) | Metric | Value | |:-----------------|:------------| | **negative_mse** | **-0.0064** | ## Training Details ### Training Dataset * Size: 9,623,924 training samples * Columns: sentence and label * Approximate statistics based on the first 1000 samples: | | sentence | label | |:--------|:-----------------------------------------------------------------------------------|:--------------------------------------| | type | string | list | | details | | | * Samples: | sentence | label | |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------| | NBA tarihinde bu ödülü en çok kaç kez kim kazanmıştır? | [-0.027497457340359688, -0.024517377838492393, -0.013820995576679707, 0.00024465256137773395, -0.020534219220280647, ...] | | Romero ve yapımcı Richard P. Rubinstein, yeni bir proje için herhangi bir yerli yatırımcılara temin koyamadıklarını söyledi. Romero Şans eseri, İtalyan korku yönetmeni Dario Argento'ya ulaştı. bu film Yaşayan Ölülerin Gecesi filmin'in kritik savunucusudur, Argento filmin korku klasik arasında yer almasına yardımcı olmak için istekliydi. uluslararası dağıtım hakları karşılığında finansman sağlamak için, Romero ve Rubinstein bir araya geldi. Senaryoyu yazarken bir sahnede değişiklik yapmak için Argento Roma'yı Romero filme davet etti. İkisi de daha sonra arsa gelişmelerini tartışmak için bir olabilirdi. Romero Monroeville Mall'ın durumunun yanı sıra Oxford Kalkınma'da alışveriş merkezi sahipleri ile bağlantıları ile ek bir güvenli finansman başardı. Döküm tamamlandıktan sonra, başlıca çekim tarihinin 13 Kasım, 1977 tarihinde film'in Pensilvanya'da başlaması planlanıyordu. | [-0.02431895025074482, -0.03177526593208313, -0.010546382516622543, 0.0393124595284462, -0.03390512242913246, ...] | | Evet, Nasuhlar ismi Adapazarı, Kandıra ve Yenipazar ilçelerinde farklı yer isimlerine aittir. | [0.0020795632153749466, -0.013080586679279804, -0.018256550654768944, 0.022429518401622772, -0.03087380714714527, ...] | * Loss: [MSELoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss) ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MSELoss ```bibtex @inproceedings{reimers-2020-multilingual-sentence-bert, title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2020", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2004.09813", } ``` #### bge-m3 ```bibtex @misc{bge-m3, title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```