SageLite-s / README.md

Dejiao Z

updated readme

3c2fe87 3 months ago

5.76 kB

	---
	license: apache-2.0
	datasets:
	- bigcode/the-stack-v2
	- tiiuae/falcon-refinedweb

	library_name: transformers
	language:
	- code
	- en
	---

	## SageLite-s

	### Model Description
	SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
	1. MLM Pretraining: Standard masked language model (MLM) pretraining on mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)).
	2. Contrastive Pre-Finetuning: Learning from a large amount of positive pairs mined from web data and GitHub.
	3. Contrastive Fine-Tuning: Fine-tuning on a small amount of synthetic data.

	---

	### Training Data
	This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.

	---


	### How to Use
	This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).

	```python
	from transformers import AutoModel, AutoTokenizer

	# Specify the checkpoint
	checkpoint = "SageLite/SageLite-s"
	device = "cuda" # Use "cpu" if GPU is unavailable

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
	model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

	# Example usage
	code_snippet = "def print_hello_world():\tprint('Hello World!')"
	inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
	embedding = model(inputs)[0] # Extract the embedding
	```

	### Code Retrieval Performance

	#### 1. Code2Code Search

	\| Model Name \| # Params \| Embd Dim \| Python \| Java \| JS \| TS \| C# \| C \| Ruby \| PhP \| GO \| AVG \|
	\|---------------------\|----------\|----------\|--------\|-------\|-------\|--------\|--------\|--------\|--------\|--------\|--------\|--------\|
	\| OpenAI-Code-01 \| NA \| 3072 \| 21.92 \| 8.90 \| 4.90 \| 5.70 \| 3.15 \| 11.58 \| 26.25 \| 16.60 \| 9.40 \| 12.04 \|
	\| OpenAI-Text-3-Small \| NA \| 1536 \| 25.18 \| 12.61 \| 8.00 \| 9.44 \| 5.46 \| 15.86 \| 30.70 \| 23.33 \| 11.20 \| 15.57 \|
	\| OpenAI-Text-3-Large \| NA \| 3072 \| 40.57 \| 25.33 \| 20.09 \| 22.00 \| 11.84 \| 31.90 \| 42.54 \| 41.84 \| 21.75 \| 28.65 \|
	\| CodeSage-v2-Small \| 130M \| 1024 \| 45.60 \| 33.65 \| 39.96 \| 47.78 \| 19.19 \| 30.55 \| 40.12 \| 55.39 \| 30.96 \| 38.13 \|
	\| CodeSage-v2-Base \| 356M \| 1024 \| 55.86 \| 42.89 \| 45.29 \| 54.58 \| 23.90 \| 38.52 \| 56.02 \| 64.56 \| 42.88 \| 47.17 \|
	\| CodeSage-v2-Large \| 1.3B \| 2048 \| 61.11 \| 47.09 \| 51.18 \| 60.67 \| 28.04 \| 43.40 \| 60.74 \| 67.87 \| 43.86 \| 51.55 \|
	\| SageLite-s \| 80M \| 768 \| 47.93 \| 30.83 \| 35.15 \| 37.64 \| 18.14 \| 30.53 \| 42.89 \| 50.70 \| 21.69 \| 35.06 \|
	\| SageLite-l \| 850M \| 1536 \| 64.46 \| 45.53 \| 50.80 \| 54.71 \| 30.66 \| 47.46 \| 61.01 \| 68.68 \| 39.25 \| 51.40 \|

	#### 2. NL2Code Search

	\| Model Name \| # Params \| CoSQA \| AdvTest \| Python \| Java \| JS \| PhP \| GO \| Ruby \| Avg \|
	\|---------------------\|----------\|-------\|---------\|--------\|-------\|-------\|--------\|--------\|--------\|--------\|
	\| OpenAI-Code-01 \| NA \| 52.20 \| 36.03 \| 63.13 \| 67.85 \| 62.30 \| 57.47 \| 85.22 \| 69.28 \| 61.69 \|
	\| OpenAI-Text-3-Small \| NA \| 52.48 \| 34.10 \| 62.62 \| 65.87 \| 60.28 \| 54.85 \| 81.96 \| 67.57 \| 59.97 \|
	\| OpenAI-Text-3-Large \| NA \| 55.21 \| 46.83 \| 70.81 \| 72.89 \| 68.12 \| 59.58 \| 87.60 \| 75.22 \| 67.03 \|
	\| CodeSage-v2-Small \| 130M \| 52.39 \| 47.28 \| 68.79 \| 68.13 \| 65.77 \| 60.20 \| 80.26 \| 72.46 \| 64.41 \|
	\| CodeSage-v2-Base \| 356M \| 50.74 \| 52.00 \| 70.46 \| 70.89 \| 69.61 \| 62.81 \| 82.37 \| 73.71 \| 66.57 \|
	\| CodeSage-v2-Large \| 1.3B \| 53.18 \| 56.31 \| 74.18 \| 72.33 \| 72.49 \| 65.26 \| 84.67 \| 76.61 \| 69.38 \|
	\| SageLite-s \| 80M \| 56.49 \| 42.32 \| 67.59 \| 66.62 \| 62.32 \| 58.87 \| 79.36 \| 70.75 \| 63.04 \|
	\| SageLite-l \| 850M \| 59.76 \| 55.55 \| 74.25 \| 71.76 \| 69.35 \| 61.62 \| 84.09 \| 77.14 \| 69.19 \|

	---

	### Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))

	\| Metric \| SageLite-s \| SageLite-l \|
	\|-------------------------------\|------------\|------------\|
	\| ArguAna \| 57.75 \| 60.71 \|
	\| CQADupstackWordpressRetrieval \| 32.42 \| 38.63 \|
	\| FiQA2018 \| 34.85 \| 46.73 \|
	\| NFCorpus \| 29.97 \| 33.70 \|
	\| QuoraRetrieval \| 85.35 \| 87.50 \|
	\| SCIDOCS \| 18.99 \| 21.38 \|
	\| SciFact \| 68.43 \| 69.05 \|
	\| Touche2020 \| 24.41 \| 21.43 \|
	\| TRECCOVID \| 70.88 \| 76.08 \|
	\| FEVER \| 71.72 \| 73.64 \|
	\| HotpotQA \| 58.81 \| 62.96 \|
	\| NQ \| 48.26 \| 54.48 \|
	\| DBPedia \| 34.83 \| 40.69 \|
	\| ClimateFEVER \| 25.69 \| 26.20 \|
	\| MSMARCO \| 35.01 \| 36.55 \|
	\| average \| 46.49 \| 49.98 \|

	---

	---
	license: apache-2.0
	datasets:
	- bigcode/the-stack-v2
	- tiiuae/falcon-refinedweb

	library_name: transformers
	language:
	- code
	- en
	---

	## SageLite-s

	### Model Description
	SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
	1. MLM Pretraining: Standard masked language model (MLM) pretraining on mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)).
	2. Contrastive Pre-Finetuning: Learning from a large amount of positive pairs mined from web data and GitHub.
	3. Contrastive Fine-Tuning: Fine-tuning on a small amount of synthetic data.

	---

	### Training Data
	This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.

	---


	### How to Use
	This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).

	```python
	from transformers import AutoModel, AutoTokenizer

	# Specify the checkpoint
	checkpoint = "SageLite/SageLite-s"
	device = "cuda" # Use "cpu" if GPU is unavailable

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
	model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

	# Example usage
	code_snippet = "def print_hello_world():\tprint('Hello World!')"
	inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
	embedding = model(inputs)[0] # Extract the embedding
	```

	### Code Retrieval Performance

	#### 1. Code2Code Search

	\| Model Name \| # Params \| Embd Dim \| Python \| Java \| JS \| TS \| C# \| C \| Ruby \| PhP \| GO \| AVG \|
	\|---------------------\|----------\|----------\|--------\|-------\|-------\|--------\|--------\|--------\|--------\|--------\|--------\|--------\|
	\| OpenAI-Code-01 \| NA \| 3072 \| 21.92 \| 8.90 \| 4.90 \| 5.70 \| 3.15 \| 11.58 \| 26.25 \| 16.60 \| 9.40 \| 12.04 \|
	\| OpenAI-Text-3-Small \| NA \| 1536 \| 25.18 \| 12.61 \| 8.00 \| 9.44 \| 5.46 \| 15.86 \| 30.70 \| 23.33 \| 11.20 \| 15.57 \|
	\| OpenAI-Text-3-Large \| NA \| 3072 \| 40.57 \| 25.33 \| 20.09 \| 22.00 \| 11.84 \| 31.90 \| 42.54 \| 41.84 \| 21.75 \| 28.65 \|
	\| CodeSage-v2-Small \| 130M \| 1024 \| 45.60 \| 33.65 \| 39.96 \| 47.78 \| 19.19 \| 30.55 \| 40.12 \| 55.39 \| 30.96 \| 38.13 \|
	\| CodeSage-v2-Base \| 356M \| 1024 \| 55.86 \| 42.89 \| 45.29 \| 54.58 \| 23.90 \| 38.52 \| 56.02 \| 64.56 \| 42.88 \| 47.17 \|
	\| CodeSage-v2-Large \| 1.3B \| 2048 \| 61.11 \| 47.09 \| 51.18 \| 60.67 \| 28.04 \| 43.40 \| 60.74 \| 67.87 \| 43.86 \| 51.55 \|
	\| SageLite-s \| 80M \| 768 \| 47.93 \| 30.83 \| 35.15 \| 37.64 \| 18.14 \| 30.53 \| 42.89 \| 50.70 \| 21.69 \| 35.06 \|
	\| SageLite-l \| 850M \| 1536 \| 64.46 \| 45.53 \| 50.80 \| 54.71 \| 30.66 \| 47.46 \| 61.01 \| 68.68 \| 39.25 \| 51.40 \|

	#### 2. NL2Code Search

	\| Model Name \| # Params \| CoSQA \| AdvTest \| Python \| Java \| JS \| PhP \| GO \| Ruby \| Avg \|
	\|---------------------\|----------\|-------\|---------\|--------\|-------\|-------\|--------\|--------\|--------\|--------\|
	\| OpenAI-Code-01 \| NA \| 52.20 \| 36.03 \| 63.13 \| 67.85 \| 62.30 \| 57.47 \| 85.22 \| 69.28 \| 61.69 \|
	\| OpenAI-Text-3-Small \| NA \| 52.48 \| 34.10 \| 62.62 \| 65.87 \| 60.28 \| 54.85 \| 81.96 \| 67.57 \| 59.97 \|
	\| OpenAI-Text-3-Large \| NA \| 55.21 \| 46.83 \| 70.81 \| 72.89 \| 68.12 \| 59.58 \| 87.60 \| 75.22 \| 67.03 \|
	\| CodeSage-v2-Small \| 130M \| 52.39 \| 47.28 \| 68.79 \| 68.13 \| 65.77 \| 60.20 \| 80.26 \| 72.46 \| 64.41 \|
	\| CodeSage-v2-Base \| 356M \| 50.74 \| 52.00 \| 70.46 \| 70.89 \| 69.61 \| 62.81 \| 82.37 \| 73.71 \| 66.57 \|
	\| CodeSage-v2-Large \| 1.3B \| 53.18 \| 56.31 \| 74.18 \| 72.33 \| 72.49 \| 65.26 \| 84.67 \| 76.61 \| 69.38 \|
	\| SageLite-s \| 80M \| 56.49 \| 42.32 \| 67.59 \| 66.62 \| 62.32 \| 58.87 \| 79.36 \| 70.75 \| 63.04 \|
	\| SageLite-l \| 850M \| 59.76 \| 55.55 \| 74.25 \| 71.76 \| 69.35 \| 61.62 \| 84.09 \| 77.14 \| 69.19 \|

	---

	### Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))

	\| Metric \| SageLite-s \| SageLite-l \|
	\|-------------------------------\|------------\|------------\|
	\| ArguAna \| 57.75 \| 60.71 \|
	\| CQADupstackWordpressRetrieval \| 32.42 \| 38.63 \|
	\| FiQA2018 \| 34.85 \| 46.73 \|
	\| NFCorpus \| 29.97 \| 33.70 \|
	\| QuoraRetrieval \| 85.35 \| 87.50 \|
	\| SCIDOCS \| 18.99 \| 21.38 \|
	\| SciFact \| 68.43 \| 69.05 \|
	\| Touche2020 \| 24.41 \| 21.43 \|
	\| TRECCOVID \| 70.88 \| 76.08 \|
	\| FEVER \| 71.72 \| 73.64 \|
	\| HotpotQA \| 58.81 \| 62.96 \|
	\| NQ \| 48.26 \| 54.48 \|
	\| DBPedia \| 34.83 \| 40.69 \|
	\| ClimateFEVER \| 25.69 \| 26.20 \|
	\| MSMARCO \| 35.01 \| 36.55 \|
	\| average \| 46.49 \| 49.98 \|

	---