Transformers
PyTorch
code
English
custom_code
Inference Endpoints
SageLite-s / README.md
Dejiao Z
updated readme
3c2fe87
---
license: apache-2.0
datasets:
- bigcode/the-stack-v2
- tiiuae/falcon-refinedweb
library_name: transformers
language:
- code
- en
---
## SageLite-s
### Model Description
SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
1. **MLM Pretraining**: Standard masked language model (MLM) pretraining on mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)).
2. **Contrastive Pre-Finetuning**: Learning from a large amount of positive pairs mined from web data and GitHub.
3. **Contrastive Fine-Tuning**: Fine-tuning on a small amount of synthetic data.
---
### **Training Data**
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
---
### **How to Use**
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
```python
from transformers import AutoModel, AutoTokenizer
# Specify the checkpoint
checkpoint = "SageLite/SageLite-s"
device = "cuda" # Use "cpu" if GPU is unavailable
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0] # Extract the embedding
```
### **Code Retrieval Performance**
#### 1. Code2Code Search
| Model Name | # Params | Embd Dim | Python | Java | JS | TS | C# | C | Ruby | PhP | GO | AVG |
|---------------------|----------|----------|--------|-------|-------|--------|--------|--------|--------|--------|--------|--------|
| OpenAI-Code-01 | NA | 3072 | 21.92 | 8.90 | 4.90 | 5.70 | 3.15 | 11.58 | 26.25 | 16.60 | 9.40 | 12.04 |
| OpenAI-Text-3-Small | NA | 1536 | 25.18 | 12.61 | 8.00 | 9.44 | 5.46 | 15.86 | 30.70 | 23.33 | 11.20 | 15.57 |
| OpenAI-Text-3-Large | NA | 3072 | 40.57 | 25.33 | 20.09 | 22.00 | 11.84 | 31.90 | 42.54 | 41.84 | 21.75 | 28.65 |
| CodeSage-v2-Small | 130M | 1024 | 45.60 | 33.65 | 39.96 | 47.78 | 19.19 | 30.55 | 40.12 | 55.39 | 30.96 | 38.13 |
| CodeSage-v2-Base | 356M | 1024 | 55.86 | 42.89 | 45.29 | 54.58 | 23.90 | 38.52 | 56.02 | 64.56 | 42.88 | 47.17 |
| CodeSage-v2-Large | 1.3B | 2048 | 61.11 | 47.09 | 51.18 | 60.67 | 28.04 | 43.40 | 60.74 | 67.87 | 43.86 | 51.55 |
| SageLite-s | 80M | 768 | 47.93 | 30.83 | 35.15 | 37.64 | 18.14 | 30.53 | 42.89 | 50.70 | 21.69 | 35.06 |
| SageLite-l | 850M | 1536 | 64.46 | 45.53 | 50.80 | 54.71 | 30.66 | 47.46 | 61.01 | 68.68 | 39.25 | 51.40 |
#### 2. NL2Code Search
| Model Name | # Params | CoSQA | AdvTest | Python | Java | JS | PhP | GO | Ruby | Avg |
|---------------------|----------|-------|---------|--------|-------|-------|--------|--------|--------|--------|
| OpenAI-Code-01 | NA | 52.20 | 36.03 | 63.13 | 67.85 | 62.30 | 57.47 | 85.22 | 69.28 | 61.69 |
| OpenAI-Text-3-Small | NA | 52.48 | 34.10 | 62.62 | 65.87 | 60.28 | 54.85 | 81.96 | 67.57 | 59.97 |
| OpenAI-Text-3-Large | NA | 55.21 | 46.83 | 70.81 | 72.89 | 68.12 | 59.58 | 87.60 | 75.22 | 67.03 |
| CodeSage-v2-Small | 130M | 52.39 | 47.28 | 68.79 | 68.13 | 65.77 | 60.20 | 80.26 | 72.46 | 64.41 |
| CodeSage-v2-Base | 356M | 50.74 | 52.00 | 70.46 | 70.89 | 69.61 | 62.81 | 82.37 | 73.71 | 66.57 |
| CodeSage-v2-Large | 1.3B | 53.18 | 56.31 | 74.18 | 72.33 | 72.49 | 65.26 | 84.67 | 76.61 | 69.38 |
| SageLite-s | 80M | 56.49 | 42.32 | 67.59 | 66.62 | 62.32 | 58.87 | 79.36 | 70.75 | 63.04 |
| SageLite-l | 850M | 59.76 | 55.55 | 74.25 | 71.76 | 69.35 | 61.62 | 84.09 | 77.14 | 69.19 |
---
### **Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))**
| Metric | SageLite-s | SageLite-l |
|-------------------------------|------------|------------|
| ArguAna | 57.75 | 60.71 |
| CQADupstackWordpressRetrieval | 32.42 | 38.63 |
| FiQA2018 | 34.85 | 46.73 |
| NFCorpus | 29.97 | 33.70 |
| QuoraRetrieval | 85.35 | 87.50 |
| SCIDOCS | 18.99 | 21.38 |
| SciFact | 68.43 | 69.05 |
| Touche2020 | 24.41 | 21.43 |
| TRECCOVID | 70.88 | 76.08 |
| FEVER | 71.72 | 73.64 |
| HotpotQA | 58.81 | 62.96 |
| NQ | 48.26 | 54.48 |
| DBPedia | 34.83 | 40.69 |
| ClimateFEVER | 25.69 | 26.20 |
| MSMARCO | 35.01 | 36.55 |
| average | 46.49 | 49.98 |
---