Kwaipilot
/

OASIS-code-1.5B

@@ -1,140 +1,162 @@
 ---
 tags:
 - sentence-transformers
 - sentence-similarity
 - feature-extraction
-pipeline_tag: sentence-similarity
-library_name: sentence-transformers
 ---
-# SentenceTransformer
-This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1536-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Model Details
-### Model Description
-- **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
-- **Maximum Sequence Length:** 1024 tokens
-- **Output Dimensionality:** 1536 dimensions
-- **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: Qwen2Model
-  (1): Pooling({'word_embedding_dimension': 1536, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
-)
-```
 ## Usage
-### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
-model = SentenceTransformer("Kwaipilot/OASIS-code-1.5B")
 # Run inference
-sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 1536]
 # Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
 ```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Framework Versions
-- Python: 3.9.19
-- Sentence Transformers: 3.3.1
-- Transformers: 4.47.1
-- PyTorch: 2.5.1+cu124
-- Accelerate: 1.2.1
-- Datasets: 2.21.0
-- Tokenizers: 0.21.0
-## Citation
 ### BibTeX
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
+library_name: sentence-transformers
+pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - sentence-similarity
 - feature-extraction
 ---
+<div align="center">
+  <img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
+</div>
+<hr>
+# Kwaipilot OASIS-1.5B
 ## Model Details
+**Model Name**: OASIS (Optimized Augmentation Strategy for Improved code Search)
+**Introduction**
+OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including **repository-level program analysis**, the **OASIS-instruct data synthesis** algorithm, and a **specialized fusion loss function**, setting new benchmarks in code search efficiency and accuracy.
+**Intended Use**
+This model is ideal for developers and researchers engaged in enhancing **code retrieval systems**. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.
+**Training and Performance**
+OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
+## Future Directions
+Kwaipilot upcoming initiatives include:
+- Open sourcing improved models.
+- Releasing technical reports.
+- Releasing natural language processing models.
+- ...
+## Performance
+|                 | Size | CoSQA | AdvTest | CSN-Py | CSN-Ja  | CSN-JS    | CSN-PHP   | CSN-Go    | CSN-Ruby  | Avg|
+|-----------------|:-----:|:------:|:---------:|:--------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|
+|Openai-Embedding-Ada-002 | Unknown  | 0.4423| 0.3808  | 0.6802 | 0.7149| 0.6750| 0.6062| 0.8563| **0.7472**|0.6378|
+|jina-embeddings-v2-base-code | 161M |**0.6837** |0.385    | 0.6634	| 0.6803| 0.6304| 0.5701| 0.8595| 0.7095|0.6477|
+| CodeSage-large          | 1.3B     | 0.4753| **0.5267**  | 0.7077 | 0.7021| **0.695** | 0.6133| 0.8371| 0.7192|0.6595|
+| CodeFuse-CGE-Small      | 3.8B     | 0.5619| 0.4639  | 0.6958 | 0.6863| 0.6564| 0.6133| 0.8637| 0.7341|0.6594|
+| OASIS-code-1.5B              | 1.5B     | 0.5532| 0.4861  | **0.7110**  | **0.7199**| 0.6727| **0.6217**| **0.8732**| 0.7333|**0.6713**|
 ## Usage
+### Direct Usage
+```bash
+pip install -U torch
+pip install -U transformers
+```
+Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.
+```python
+import torch
+import torch.nn.functional as F
+from torch import Tensor
+from transformers import AutoModel, AutoTokenizer
+def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
+    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
+    if left_padding:
+        return last_hidden_states[:, -1]
+    else:
+        sequence_lengths = attention_mask.sum(dim=1) - 1
+        batch_size = last_hidden_states.shape[0]
+        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
+# Add query prompt
+def get_query_prompt(query: str):
+    query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
+    prompt = f'Instruct: {query_description}\nQuery: {query}'
+    return prompt
+query = "How to do quicksort in python?"
+code1 = """def bubble_sort(arr):
+    n = len(arr)
+    for i in range(n):
+        swapped = False
+        for j in range(1, n - i):
+            if arr[j - 1] > arr[j]:
+                arr[j - 1], arr[j] = arr[j], arr[j - 1]
+                swapped = True
+        if not swapped:
+            break
+    return arr"""
+code2 = """def quick_sort(arr):
+    if len(arr) <= 1:
+        return arr
+    else:
+        pivot = arr[0]
+        less = [x for x in arr[1:] if x <= pivot]
+        greater = [x for x in arr[1:] if x > pivot]
+        return quick_sort(less) + [pivot] + quick_sort(greater)"""
+model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.3B", output_hidden_states=True)
+tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.3B")
+# Tokenize and inference
+inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=8192, padding=True, truncation=True, return_tensors='pt')
+outputs = model(**inputs)
+# Last token pooling
+embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
+print(embeddings.shape)
+# torch.Size([3, 2048])
+embeddings = F.normalize(embeddings, dim=1, p=2)
+similarity = embeddings @ embeddings.T
+print(similarity[0, 1:])
+# tensor([0.6495, 0.8036])
+```
+### Sentence Transformers
+First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
+model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")#, model_kwargs={"torch_dtype": torch.bfloat16})
+query = "How to do quicksort in python?"
+code1 = """def bubble_sort(arr):
+    n = len(arr)
+    for i in range(n):
+        swapped = False
+        for j in range(1, n - i):
+            if arr[j - 1] > arr[j]:
+                arr[j - 1], arr[j] = arr[j], arr[j - 1]
+                swapped = True
+        if not swapped:
+            break
+    return arr"""
+code2 = """def quick_sort(arr):
+    if len(arr) <= 1:
+        return arr
+    else:
+        pivot = arr[0]
+        less = [x for x in arr[1:] if x <= pivot]
+        greater = [x for x in arr[1:] if x > pivot]
+        return quick_sort(less) + [pivot] + quick_sort(greater)"""
 # Run inference
+query_embedding = model.encode([query], prompt_name="query")
+code_embeddings = model.encode([code1, code2])
+print(code_embeddings.shape)
+# (2, 2048)
 # Get the similarity scores for the embeddings
+print(model.similarity(query_embedding[0], code_embeddings[0]))
+print(model.similarity(query_embedding[0], code_embeddings[1]))
+# tensor([[0.6495]])
+# tensor([[0.8036]])
 ```
 ### BibTeX
+```bibtex
+@misc{kwaipilotoasis,
+  title = {Optimized Augmentation Strategy for Improved code Search},
+  author = {Kwaipilot team},
+  year = {2024},
+}
+```