--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:CachedGISTEmbedLoss base_model: Snowflake/snowflake-arctic-embed-l-v2.0 license: apache-2.0 language: - ko - en pipeline_tag: sentence-similarity library_name: sentence-transformers --- # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l-v2.0 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) on the clustered datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search. The **Snowflake/snowflake-arctic-embed-l-v2.0** model has been further trained with Korean data to enhance its performance in **Korean retrieval tasks**. It is a powerful model that achieves **state-of-the-art (SOTA) performance across multiple retrieval benchmarks**. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [Snowflake/snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Cosine Similarity - **Training Datasets:** - AI Hub Dataset - 행정 문서 대상 기계 독해 - 기계 독해 - 뉴스 기사 기계독해 - 도서 자료 기계독해 - 숫자 연산 기계독해 - 금융 법률 문서 기계독해 - **Language:** Korean, English ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage First install the Sentence Transformers library and xformers library ```bash pip install -U sentence-transformers pip install xformers ``` Then you can load this model and run inference. ### Using Sentence Transformers ```python from sentence_transformers import SentenceTransformer # Load the model model_name = 'dragonkue/snowflake-arctic-embed-l-v2.0-ko' model = SentenceTransformer(model_name) # Define the queries and documents queries = ['대한민국의 수도는 어디인가?', '한글을 만든 사람은 누구인가?'] documents = ['대한민국의 수도는 서울이다.', '한글은 세종대왕이 창제하였다.'] # Compute embeddings: use `prompt_name="query"` to encode queries! query_embeddings = model.encode(queries, prompt_name="query") document_embeddings = model.encode(documents) # Compute cosine similarity scores scores = model.similarity(query_embeddings, document_embeddings) # Output the results for query, query_scores in zip(queries, scores): doc_score_pairs = list(zip(documents, query_scores)) doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) print("Query:", query) for document, score in doc_score_pairs: print(score, document) ``` ### Using Huggingface Transformers You can use the transformers package to use Snowflake's arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query). ```python import torch from transformers import AutoModel, AutoTokenizer model_name = 'dragonkue/snowflake-arctic-embed-l-v2.0-ko' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name, add_pooling_layer=False) model.eval() query_prefix = 'query: ' queries = ['대한민국의 수도는 어디인가?', '한글을 만든 사람은 누구인가?'] queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries] query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=8192) documents = ['대한민국의 수도는 서울이다.', '한글은 세종대왕이 창제하였다.'] document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=8192) # Compute token embeddings with torch.no_grad(): query_embeddings = model(**query_tokens)[0][:, 0] document_embeddings = model(**document_tokens)[0][:, 0] # Normalize embeddings query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1) document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1) scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1)) for query, query_scores in zip(queries, scores): doc_score_pairs = list(zip(documents, query_scores)) doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) # Output passages & scores print("Query:", query) for document, score in doc_score_pairs: print(score, document) ``` ## Evaluation - This evaluation references the KURE GitHub repository. (https://github.com/nlpai-lab/KURE) - We conducted an evaluation on all **Korean Retrieval Benchmarks** registered in [MTEB](https://github.com/embeddings-benchmark/mteb). ### Korean Retrieval Benchmark - [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): A Korean **ODQA multi-hop retrieval dataset**, translated from StrategyQA. - [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): A **Korean document retrieval dataset** constructed by parsing PDFs from five domains: **finance, public, medical, legal, and commerce**. - [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl): A **Korean document retrieval dataset** based on Wikipedia. - [PublicHealthQA](https://huggingface.co/datasets/xhluca/publichealth-qa): A **retrieval dataset** focused on **medical and public health domains** in Korean. - [BelebeleRetrieval](https://huggingface.co/datasets/facebook/belebele): A **Korean document retrieval dataset** based on FLORES-200. - [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): A **Wikipedia-based Korean document retrieval dataset**. - [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR): A **long-document retrieval dataset** covering various domains in Korean. - [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): A **cross-domain Korean document retrieval dataset**. ### Metrics * Standard metric : NDCG@10 ### Information Retrieval - Achieves state-of-the-art (SOTA) performance across various benchmarks. - For each benchmark, the **highest score** is highlighted in bold, and the _second-highest score_ is italicized. | Model | MrTidyRetrieval | MIRACLRetrieval | XPQARetrieval | BelebeleRetrieval | PublicHealthQA | AutoRAGRetrieval | Ko-StrategyQA | Average | |:-------------------------------------------------------------------------------------------------|:------------------|:------------------|:----------------|:--------------------|:-----------------|:-------------------|:----------------|:-------------| | dragonkue/snowflake-arctic-embed-l-v2.0-ko | 0.57121 | 0.66846 | **0.4436** | **0.95177** | 0.83374 | **0.90927** | _0.80498_ | **0.740433** | | dragonkue/BGE-m3-ko | 0.60992 | 0.68331 | 0.38131 | _0.95027_ | 0.81545 | _0.87379_ | 0.7959 | _0.729993_ | | nlpai-lab/KURE-v1 | 0.59092 | 0.68157 | 0.38158 | 0.95019 | 0.81925 | 0.87076 | 0.7999 | 0.727739 | | BAAI/bge-m3 | **0.64708** | _0.70146_ | 0.36075 | 0.93164 | 0.80412 | 0.83008 | 0.79405 | 0.724169 | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.59071 | 0.66077 | _0.43018_ | 0.9271 | 0.81679 | 0.83863 | 0.80455 | 0.724104 | | intfloat/multilingual-e5-large | _0.64211_ | 0.66486 | 0.3571 | 0.94499 | 0.82534 | 0.81337 | 0.80348 | 0.721607 | | nlpai-lab/KoE5 | 0.58411 | 0.62347 | 0.35086 | 0.94251 | 0.83507 | 0.84339 | 0.80008 | 0.711356 | | BAAI/bge-multilingual-gemma2 | 0.47521 | **0.70315** | 0.37446 | 0.95001 | _0.87102_ | 0.76535 | 0.79072 | 0.704274 | | jinaai/jina-embeddings-v3 | 0.55759 | 0.63716 | 0.41272 | 0.91203 | 0.83059 | 0.76104 | 0.79807 | 0.701314 | | intfloat/multilingual-e5-large-instruct | 0.52877 | 0.59914 | 0.39712 | 0.936 | 0.84967 | 0.77996 | 0.79793 | 0.69837 | | nomic-ai/nomic-embed-text-v2-moe | 0.53766 | 0.65913 | 0.36871 | 0.93636 | 0.78448 | 0.80682 | 0.76325 | 0.693773 | | intfloat/multilingual-e5-base | 0.58082 | 0.6227 | 0.3607 | 0.92868 | 0.77203 | 0.79752 | 0.76355 | 0.689429 | | intfloat/e5-mistral-7b-instruct | 0.52444 | 0.58709 | 0.39159 | 0.92403 | **0.88733** | 0.67849 | 0.79317 | 0.683734 | | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46571 | 0.53375 | 0.37866 | 0.94808 | 0.85844 | 0.76682 | **0.8108** | 0.680323 | | Alibaba-NLP/gte-multilingual-base | 0.56464 | 0.62697 | 0.30702 | 0.8796 | 0.74584 | 0.77108 | 0.75121 | 0.663766 | | openai/text-embedding-3-large | 0.44728 | 0.56248 | 0.37423 | 0.89451 | 0.85617 | 0.76466 | 0.73634 | 0.662239 | | upskyy/bge-m3-korean | 0.55011 | 0.59892 | 0.31695 | 0.8731 | 0.77559 | 0.72946 | 0.75277 | 0.6567 | | Salesforce/SFR-Embedding-2_R | 0.40347 | 0.55798 | 0.37371 | 0.91747 | 0.8605 | 0.70782 | 0.77042 | 0.65591 | | ibm-granite/granite-embedding-278m-multilingual | nan | 0.59216 | 0.23058 | 0.83231 | 0.77668 | 0.70226 | 0.71762 | 0.641935 | | jhgan/ko-sroberta-multitask | 0.29475 | 0.36698 | 0.27961 | 0.81636 | 0.69212 | 0.58332 | 0.65097 | 0.526301 | #### Capabilities Beyond Benchmarks This model is designed to handle various retrieval scenarios that are not directly measured in benchmarks: 1. Supports phrase-based queries in addition to full-sentence queries. Example: "What products does Samsung sell?" or "Samsung's products" 2. Trained to handle diverse query formats, regardless of phrasing variations. Example: "Tell me about Samsung.", "I'm curious about Samsung.", "What is Samsung?" 3. Optimized for Markdown table search, allowing retrieval of answers embedded within tables when present in documents. 4. Efficient clustering without hard negatives: - Samples within the same batch are clustered together. - Uses efficient embedding formation for clustering by truncating embeddings from the Snowflake/snowflake-arctic-embed-l-v2.0 model to 256 dimensions. - The clustering approach is inspired by the findings in the following papers: - *Embedding And Clustering Your Data Can Improve Contrastive Pretraining* - *CONTEXTUAL DOCUMENT EMBEDDINGS* 5. Strong performance across different domains: - The *Arctic-Embed 2.0: Multilingual Retrieval Without Compromise* paper states: *"While models like mE5, mGTE, and BGE-M3 excel on MIRACL, their performance on CLEF is notably weaker compared to ours and closed-source offerings, suggesting the potential of overfitting to MIRACL or its Wikipedia-based domain."* - Based on my own experience, **Snowflake/snowflake-arctic-embed-l-v2.0** has consistently outperformed **BGE-M3** in different domains, further validating this observation. ## Bias, Risks and Limitations To prevent excessive GPU usage costs, the model was trained with a maximum sequence length of **1300** tokens. As a result, its performance may degrade on benchmarks like MultiLongDocRetrieval (MLDR). The previous model, **BGE-m3-ko**, was trained with a token length of **1024**, which imposed limitations on its MLDR benchmark performance. In the case of **snowflake-arctic-embed-l-v2.0-ko**, if the document length exceeds **1300** tokens or approximately **2500** characters, it is recommended to consider the following models instead. | Model | MultiLongDocRetrieval | |:-------------------------------------------------------------------------------------------------|------------------------:| | Alibaba-NLP/gte-multilingual-base/Alibaba-NLP/gte-multilingual-base | **0.48402** | | nlpai-lab/KURE-v1/nlpai-lab_KURE-v1 | _0.47528_ | | dragonkue/snowflake-arctic-embed-l-v2.0-ko | 0.4459 | | BAAI/bge-m3/BAAI_bge-m3 | 0.43011 | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.40401 | | dragonkue/BGE-m3-ko/dragonkue_BGE-m3-ko | 0.40135 | | openai/text-embedding-3-large | 0.31108 | | BAAI/bge-multilingual-gemma2 | 0.31021 | | nlpai-lab/KoE5 | 0.30869 | | jinaai/jina-embeddings-v3/jinaai__jina-embeddings-v3 | 0.30512 | | Alibaba-NLP/gte-Qwen2-7B-instruct/Alibaba-NLP__gte-Qwen2-7B-instruct | 0.30313 | | intfloat/multilingual-e5-large-instruct/intfloat__multilingual-e5-large-instruct | 0.27973 | | nomic-ai/nomic-embed-text-v2-moe | 0.27135 | | intfloat/e5-mistral-7b-instruct/intfloat__e5-mistral-7b-instruct | 0.2583 | | intfloat/multilingual-e5-large/intfloat__multilingual-e5-large | 0.24596 | | Salesforce/SFR-Embedding-2_R/Salesforce__SFR-Embedding-2_R | 0.24346 | | intfloat/multilingual-e5-base/intfloat__multilingual-e5-base | 0.23766 | | upskyy/bge-m3-korean/upskyy__bge-m3-korean | 0.21968 | | ibm-granite/granite-embedding-278m-multilingual/ibm-granite__granite-embedding-278m-multilingual | 0.20781 | | jhgan/ko-sroberta-multitask/jhgan__ko-sroberta-multitask | 0.20416 | ## Training Details * Loss: [CachedGISTEmbedLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedgistembedloss) with these parameters: ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: steps - `per_device_train_batch_size`: 20000 - `per_device_eval_batch_size`: 4096 - `learning_rate`: 2e-05 - `num_train_epochs`: 2 - `lr_scheduler_type`: warmup_stable_decay - `lr_scheduler_kwargs`: {'num_decay_steps': 160} - `warmup_ratio`: 0.05 - `bf16`: True - `batch_sampler`: no_duplicates #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: steps - `prediction_loss_only`: True - `per_device_train_batch_size`: 10000 - `per_device_eval_batch_size`: 4096 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 2e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1.0 - `num_train_epochs`: 2 - `max_steps`: -1 - `lr_scheduler_type`: warmup_stable_decay - `lr_scheduler_kwargs`: {'num_decay_steps': 160} - `warmup_ratio`: 0.05 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: True - `fp16`: False - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: True - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `dispatch_batches`: None - `split_batches`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `eval_use_gather_object`: False - `average_tokens_across_devices`: False - `prompts`: None - `batch_sampler`: no_duplicates - `multi_dataset_batch_sampler`: proportional
### Framework Versions - Python: 3.10.12 - Sentence Transformers: 3.4.1 - Transformers: 4.49.0 - PyTorch: 2.6.0+cu124 - Accelerate: 1.4.0 - Datasets: 3.3.2 - Tokenizers: 0.21.0 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084" } ``` #### KURE ```bibtex @misc{KURE, publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee}, year = {2024}, url = {https://github.com/nlpai-lab/KURE} } ``` #### Arctic-Embed 2.0 ```bibtex @article{yu2024arcticembed, title = "Arctic-Embed 2.0: Multilingual Retrieval Without Compromise", author = "Puxuan Yu, Luke Merrick, Gaurav Nuti, Daniel Campos", journal = "arXiv preprint arXiv:2412.04506", year = "2024", url = "https://arxiv.org/abs/2412.04506" } ``` #### Embedding And Clustering Your Data Can Improve Contrastive Pretraining ```bibtex @article{merrick2024embedding, title = "Embedding And Clustering Your Data Can Improve Contrastive Pretraining", author = "Luke Merrick", journal = "arXiv preprint arXiv:2407.18887", year = "2024", url = "https://arxiv.org/abs/2407.18887" } ``` #### Contextual Document Embeddings ```bibtex @article{morris2024contextual, title = "Contextual Document Embeddings", author = "John X. Morris, Alexander M. Rush", journal = "arXiv preprint arXiv:2410.02525", year = "2024", url = "https://arxiv.org/abs/2410.02525" } ``` ## License Arctic is licensed under the **Apache-2**. The released models can be used for commercial purposes free of charge.