Zuchen commited on
Commit
8df147f
·
verified ·
1 Parent(s): 34934b5

Copy readme from 1.3B

Browse files
Files changed (1) hide show
  1. README.md +132 -110
README.md CHANGED
@@ -1,140 +1,162 @@
1
  ---
 
 
2
  tags:
3
  - sentence-transformers
4
  - sentence-similarity
5
  - feature-extraction
6
- pipeline_tag: sentence-similarity
7
- library_name: sentence-transformers
8
  ---
 
 
 
 
9
 
10
- # SentenceTransformer
11
-
12
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1536-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
13
 
14
  ## Model Details
 
15
 
16
- ### Model Description
17
- - **Model Type:** Sentence Transformer
18
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
19
- - **Maximum Sequence Length:** 1024 tokens
20
- - **Output Dimensionality:** 1536 dimensions
21
- - **Similarity Function:** Cosine Similarity
22
- <!-- - **Training Dataset:** Unknown -->
23
- <!-- - **Language:** Unknown -->
24
- <!-- - **License:** Unknown -->
25
 
26
- ### Model Sources
27
 
28
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
29
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
30
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
31
 
32
- ### Full Model Architecture
33
 
34
- ```
35
- SentenceTransformer(
36
- (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: Qwen2Model
37
- (1): Pooling({'word_embedding_dimension': 1536, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
38
- )
39
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ## Usage
42
 
43
- ### Direct Usage (Sentence Transformers)
44
 
45
- First install the Sentence Transformers library:
 
 
 
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```bash
48
  pip install -U sentence-transformers
49
  ```
50
-
51
  Then you can load this model and run inference.
52
  ```python
53
  from sentence_transformers import SentenceTransformer
54
-
55
  # Download from the 🤗 Hub
56
- model = SentenceTransformer("Kwaipilot/OASIS-code-1.5B")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  # Run inference
58
- sentences = [
59
- 'The weather is lovely today.',
60
- "It's so sunny outside!",
61
- 'He drove to the stadium.',
62
- ]
63
- embeddings = model.encode(sentences)
64
- print(embeddings.shape)
65
- # [3, 1536]
66
-
67
  # Get the similarity scores for the embeddings
68
- similarities = model.similarity(embeddings, embeddings)
69
- print(similarities.shape)
70
- # [3, 3]
 
71
  ```
72
-
73
- <!--
74
- ### Direct Usage (Transformers)
75
-
76
- <details><summary>Click to see the direct usage in Transformers</summary>
77
-
78
- </details>
79
- -->
80
-
81
- <!--
82
- ### Downstream Usage (Sentence Transformers)
83
-
84
- You can finetune this model on your own dataset.
85
-
86
- <details><summary>Click to expand</summary>
87
-
88
- </details>
89
- -->
90
-
91
- <!--
92
- ### Out-of-Scope Use
93
-
94
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
95
- -->
96
-
97
- <!--
98
- ## Bias, Risks and Limitations
99
-
100
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
101
- -->
102
-
103
- <!--
104
- ### Recommendations
105
-
106
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
107
- -->
108
-
109
- ## Training Details
110
-
111
- ### Framework Versions
112
- - Python: 3.9.19
113
- - Sentence Transformers: 3.3.1
114
- - Transformers: 4.47.1
115
- - PyTorch: 2.5.1+cu124
116
- - Accelerate: 1.2.1
117
- - Datasets: 2.21.0
118
- - Tokenizers: 0.21.0
119
-
120
- ## Citation
121
-
122
  ### BibTeX
123
-
124
- <!--
125
- ## Glossary
126
-
127
- *Clearly define terms in order to be accessible across audiences.*
128
- -->
129
-
130
- <!--
131
- ## Model Card Authors
132
-
133
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
134
- -->
135
-
136
- <!--
137
- ## Model Card Contact
138
-
139
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
140
- -->
 
1
  ---
2
+ library_name: sentence-transformers
3
+ pipeline_tag: sentence-similarity
4
  tags:
5
  - sentence-transformers
6
  - sentence-similarity
7
  - feature-extraction
 
 
8
  ---
9
+ <div align="center">
10
+ <img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
11
+ </div>
12
+ <hr>
13
 
14
+ # Kwaipilot OASIS-1.5B
 
 
15
 
16
  ## Model Details
17
+ **Model Name**: OASIS (Optimized Augmentation Strategy for Improved code Search)
18
 
19
+ **Introduction**
 
 
 
 
 
 
 
 
20
 
21
+ OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including **repository-level program analysis**, the **OASIS-instruct data synthesis** algorithm, and a **specialized fusion loss function**, setting new benchmarks in code search efficiency and accuracy.
22
 
23
+ **Intended Use**
 
 
24
 
25
+ This model is ideal for developers and researchers engaged in enhancing **code retrieval systems**. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.
26
 
27
+ **Training and Performance**
28
+
29
+ OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
30
+
31
+ ## Future Directions
32
+ Kwaipilot upcoming initiatives include:
33
+
34
+ - Open sourcing improved models.
35
+ - Releasing technical reports.
36
+ - Releasing natural language processing models.
37
+ - ...
38
+
39
+
40
+ ## Performance
41
+
42
+ | | Size | CoSQA | AdvTest | CSN-Py | CSN-Ja | CSN-JS | CSN-PHP | CSN-Go | CSN-Ruby | Avg|
43
+ |-----------------|:-----:|:------:|:---------:|:--------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|
44
+ |Openai-Embedding-Ada-002 | Unknown | 0.4423| 0.3808 | 0.6802 | 0.7149| 0.6750| 0.6062| 0.8563| **0.7472**|0.6378|
45
+ |jina-embeddings-v2-base-code | 161M |**0.6837** |0.385 | 0.6634 | 0.6803| 0.6304| 0.5701| 0.8595| 0.7095|0.6477|
46
+ | CodeSage-large | 1.3B | 0.4753| **0.5267** | 0.7077 | 0.7021| **0.695** | 0.6133| 0.8371| 0.7192|0.6595|
47
+ | CodeFuse-CGE-Small | 3.8B | 0.5619| 0.4639 | 0.6958 | 0.6863| 0.6564| 0.6133| 0.8637| 0.7341|0.6594|
48
+ | OASIS-code-1.5B | 1.5B | 0.5532| 0.4861 | **0.7110** | **0.7199**| 0.6727| **0.6217**| **0.8732**| 0.7333|**0.6713**|
49
 
50
  ## Usage
51
 
52
+ ### Direct Usage
53
 
54
+ ```bash
55
+ pip install -U torch
56
+ pip install -U transformers
57
+ ```
58
 
59
+ Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.
60
+ ```python
61
+ import torch
62
+ import torch.nn.functional as F
63
+ from torch import Tensor
64
+ from transformers import AutoModel, AutoTokenizer
65
+ def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
66
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
67
+ if left_padding:
68
+ return last_hidden_states[:, -1]
69
+ else:
70
+ sequence_lengths = attention_mask.sum(dim=1) - 1
71
+ batch_size = last_hidden_states.shape[0]
72
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
73
+ # Add query prompt
74
+ def get_query_prompt(query: str):
75
+ query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
76
+ prompt = f'Instruct: {query_description}\nQuery: {query}'
77
+ return prompt
78
+ query = "How to do quicksort in python?"
79
+
80
+ code1 = """def bubble_sort(arr):
81
+ n = len(arr)
82
+ for i in range(n):
83
+ swapped = False
84
+ for j in range(1, n - i):
85
+ if arr[j - 1] > arr[j]:
86
+ arr[j - 1], arr[j] = arr[j], arr[j - 1]
87
+ swapped = True
88
+ if not swapped:
89
+ break
90
+ return arr"""
91
+ code2 = """def quick_sort(arr):
92
+ if len(arr) <= 1:
93
+ return arr
94
+ else:
95
+ pivot = arr[0]
96
+ less = [x for x in arr[1:] if x <= pivot]
97
+ greater = [x for x in arr[1:] if x > pivot]
98
+ return quick_sort(less) + [pivot] + quick_sort(greater)"""
99
+ model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.3B", output_hidden_states=True)
100
+ tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.3B")
101
+
102
+ # Tokenize and inference
103
+ inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=8192, padding=True, truncation=True, return_tensors='pt')
104
+ outputs = model(**inputs)
105
+ # Last token pooling
106
+ embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
107
+ print(embeddings.shape)
108
+ # torch.Size([3, 2048])
109
+ embeddings = F.normalize(embeddings, dim=1, p=2)
110
+ similarity = embeddings @ embeddings.T
111
+ print(similarity[0, 1:])
112
+ # tensor([0.6495, 0.8036])
113
+ ```
114
+ ### Sentence Transformers
115
+ First install the Sentence Transformers library:
116
  ```bash
117
  pip install -U sentence-transformers
118
  ```
 
119
  Then you can load this model and run inference.
120
  ```python
121
  from sentence_transformers import SentenceTransformer
 
122
  # Download from the 🤗 Hub
123
+ model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")#, model_kwargs={"torch_dtype": torch.bfloat16})
124
+ query = "How to do quicksort in python?"
125
+ code1 = """def bubble_sort(arr):
126
+ n = len(arr)
127
+ for i in range(n):
128
+ swapped = False
129
+ for j in range(1, n - i):
130
+ if arr[j - 1] > arr[j]:
131
+ arr[j - 1], arr[j] = arr[j], arr[j - 1]
132
+ swapped = True
133
+ if not swapped:
134
+ break
135
+ return arr"""
136
+ code2 = """def quick_sort(arr):
137
+ if len(arr) <= 1:
138
+ return arr
139
+ else:
140
+ pivot = arr[0]
141
+ less = [x for x in arr[1:] if x <= pivot]
142
+ greater = [x for x in arr[1:] if x > pivot]
143
+ return quick_sort(less) + [pivot] + quick_sort(greater)"""
144
  # Run inference
145
+ query_embedding = model.encode([query], prompt_name="query")
146
+ code_embeddings = model.encode([code1, code2])
147
+ print(code_embeddings.shape)
148
+ # (2, 2048)
 
 
 
 
 
149
  # Get the similarity scores for the embeddings
150
+ print(model.similarity(query_embedding[0], code_embeddings[0]))
151
+ print(model.similarity(query_embedding[0], code_embeddings[1]))
152
+ # tensor([[0.6495]])
153
+ # tensor([[0.8036]])
154
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  ### BibTeX
156
+ ```bibtex
157
+ @misc{kwaipilotoasis,
158
+ title = {Optimized Augmentation Strategy for Improved code Search},
159
+ author = {Kwaipilot team},
160
+ year = {2024},
161
+ }
162
+ ```