RichardErkhov commited on
Commit
6ce1d8e
·
verified ·
1 Parent(s): d6f0c4a

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +311 -0
README.md ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ SeaLLMs-v3-7B-Chat - GGUF
11
+ - Model creator: https://huggingface.co/SeaLLMs/
12
+ - Original model: https://huggingface.co/SeaLLMs/SeaLLMs-v3-7B-Chat/
13
+
14
+
15
+ | Name | Quant method | Size |
16
+ | ---- | ---- | ---- |
17
+ | [SeaLLMs-v3-7B-Chat.Q2_K.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q2_K.gguf) | Q2_K | 2.11GB |
18
+ | [SeaLLMs-v3-7B-Chat.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.IQ3_XS.gguf) | IQ3_XS | 3.12GB |
19
+ | [SeaLLMs-v3-7B-Chat.IQ3_S.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.IQ3_S.gguf) | IQ3_S | 0.68GB |
20
+ | [SeaLLMs-v3-7B-Chat.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q3_K_S.gguf) | Q3_K_S | 3.25GB |
21
+ | [SeaLLMs-v3-7B-Chat.IQ3_M.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.IQ3_M.gguf) | IQ3_M | 3.33GB |
22
+ | [SeaLLMs-v3-7B-Chat.Q3_K.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q3_K.gguf) | Q3_K | 0.65GB |
23
+ | [SeaLLMs-v3-7B-Chat.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q3_K_M.gguf) | Q3_K_M | 0.34GB |
24
+ | [SeaLLMs-v3-7B-Chat.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q3_K_L.gguf) | Q3_K_L | 0.13GB |
25
+ | [SeaLLMs-v3-7B-Chat.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.IQ4_XS.gguf) | IQ4_XS | 1.66GB |
26
+ | [SeaLLMs-v3-7B-Chat.Q4_0.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q4_0.gguf) | Q4_0 | 0.69GB |
27
+ | [SeaLLMs-v3-7B-Chat.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.IQ4_NL.gguf) | IQ4_NL | 0.17GB |
28
+ | [SeaLLMs-v3-7B-Chat.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q4_K_S.gguf) | Q4_K_S | 0.01GB |
29
+ | [SeaLLMs-v3-7B-Chat.Q4_K.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q4_K.gguf) | Q4_K | 0.0GB |
30
+ | [SeaLLMs-v3-7B-Chat.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q4_K_M.gguf) | Q4_K_M | 0.0GB |
31
+ | [SeaLLMs-v3-7B-Chat.Q4_1.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q4_1.gguf) | Q4_1 | 0.0GB |
32
+ | [SeaLLMs-v3-7B-Chat.Q5_0.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q5_0.gguf) | Q5_0 | 0.0GB |
33
+ | [SeaLLMs-v3-7B-Chat.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q5_K_S.gguf) | Q5_K_S | 0.0GB |
34
+ | [SeaLLMs-v3-7B-Chat.Q5_K.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q5_K.gguf) | Q5_K | 0.0GB |
35
+ | [SeaLLMs-v3-7B-Chat.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q5_K_M.gguf) | Q5_K_M | 0.0GB |
36
+ | [SeaLLMs-v3-7B-Chat.Q5_1.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q5_1.gguf) | Q5_1 | 0.0GB |
37
+ | [SeaLLMs-v3-7B-Chat.Q6_K.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q6_K.gguf) | Q6_K | 0.0GB |
38
+ | [SeaLLMs-v3-7B-Chat.Q8_0.gguf](https://huggingface.co/RichardErkhov/SeaLLMs_-_SeaLLMs-v3-7B-Chat-gguf/blob/main/SeaLLMs-v3-7B-Chat.Q8_0.gguf) | Q8_0 | 0.0GB |
39
+
40
+
41
+
42
+
43
+ Original model description:
44
+ ---
45
+ license: other
46
+ license_name: seallms
47
+ license_link: https://huggingface.co/SeaLLMs/SeaLLM-13B-Chat/blob/main/LICENSE
48
+ language:
49
+ - en
50
+ - zh
51
+ - id
52
+ - vi
53
+ - th
54
+ - ms
55
+ tags:
56
+ - sea
57
+ - multilingual
58
+ ---
59
+
60
+ # *SeaLLMs-v3* - Large Language Models for Southeast Asia
61
+
62
+ <p align="center">
63
+ <a href="https://damo-nlp-sg.github.io/SeaLLMs/" target="_blank" rel="noopener">Website</a>
64
+ &nbsp;&nbsp;
65
+ <a href="https://huggingface.co/SeaLLMs/SeaLLMs-v3-7B-Chat" target="_blank" rel="noopener">Model</a>
66
+ &nbsp;&nbsp;
67
+ <a href="https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat" target="_blank" rel="noopener"> 🤗 DEMO</a>
68
+ &nbsp;&nbsp;
69
+ <a href="https://github.com/DAMO-NLP-SG/SeaLLMs" target="_blank" rel="noopener">Github</a>
70
+ &nbsp;&nbsp;
71
+ <a href="https://arxiv.org/pdf/2407.19672" target="_blank" rel="noopener">[NEW] Technical Report</a>
72
+ </p>
73
+
74
+ We introduce **SeaLLMs-v3**, the latest series of the SeaLLMs (Large Language Models for Southeast Asian languages) family. It achieves state-of-the-art performance among models with similar sizes, excelling across a diverse array of tasks such as world knowledge, mathematical reasoning, translation, and instruction following. In the meantime, it was specifically enhanced to be more trustworthy, exhibiting reduced hallucination and providing safe responses, particularly in queries closed related to Southeast Asian culture.
75
+
76
+ ## 🔥 Highlights
77
+ - State-of-the-art performance compared to open-source models of similar sizes, evaluated across various dimensions such as human exam questions, instruction-following, mathematics, and translation.
78
+ - Significantly enhanced instruction-following capability, especially in multi-turn settings.
79
+ - Ensures safety in usage with significantly reduced instances of hallucination and sensitivity to local contexts.
80
+
81
+ ## Uses
82
+
83
+ SeaLLMs is tailored for handling a wide range of languages spoken in the SEA region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
84
+
85
+ This page introduces the **SeaLLMs-v3-7B-Chat** model, specifically fine-tuned to follow human instructions effectively for task completion, making it directly applicable to your applications.
86
+
87
+ You may also refer to the [SeaLLMs-v3-1.5B-Chat](https://huggingface.co/SeaLLMs/SeaLLMs-v3-1.5B-Chat) model which requires much lower computational resources and can be easily loaded locally.
88
+
89
+
90
+ ### Get started with `Transformers`
91
+
92
+ To quickly try the model, we show how to conduct inference with `transformers` below. Make sure you have installed the latest transformers version (>4.40).
93
+
94
+ ```python
95
+ from transformers import AutoModelForCausalLM, AutoTokenizer
96
+
97
+ device = "cuda" # the device to load the model onto
98
+
99
+ model = AutoModelForCausalLM.from_pretrained(
100
+ "SeaLLMs/SeaLLMs-v3-7B-Chat", # can change to "SeaLLMs/SeaLLMs-v3-1.5B-Chat" if your resource is limited
101
+ torch_dtype=torch.bfloat16,
102
+ device_map=device
103
+ )
104
+ tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLMs-v3-7B-Chat")
105
+
106
+ # prepare messages to model
107
+ prompt = "Hiii How are you?"
108
+ messages = [
109
+ {"role": "system", "content": "You are a helpful assistant."},
110
+ {"role": "user", "content": prompt}
111
+ ]
112
+
113
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
114
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
115
+ print(f"Formatted text:\n {text}")
116
+ print(f"Model input:\n {model_inputs}")
117
+
118
+ generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True, eos_token_id=tokenizer.eos_token_id)
119
+ generated_ids = [
120
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
121
+ ]
122
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
123
+
124
+ print(f"Response:\n {response[0]}")
125
+ ```
126
+
127
+ You can also utilize the following code snippet, which uses the streamer `TextStreamer` to enable the model to continue conversing with you:
128
+
129
+ ```python
130
+ from transformers import AutoModelForCausalLM, AutoTokenizer
131
+ from transformers import TextStreamer
132
+
133
+ device = "cuda" # the device to load the model onto
134
+
135
+ model = AutoModelForCausalLM.from_pretrained(
136
+ "SeaLLMs/SeaLLMs-v3-7B-Chat", # can change to "SeaLLMs/SeaLLMs-v3-1.5B-Chat" if your resource is limited
137
+ torch_dtype=torch.bfloat16,
138
+ device_map=device
139
+ )
140
+ tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLMs-v3-7B-Chat")
141
+
142
+ # prepare messages to model
143
+ messages = [
144
+ {"role": "system", "content": "You are a helpful assistant."},
145
+ ]
146
+
147
+ while True:
148
+ prompt = input("User:")
149
+ messages.append({"role": "user", "content": prompt})
150
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
151
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
152
+
153
+ streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
154
+ generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, streamer=streamer)
155
+ generated_ids = [
156
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
157
+ ]
158
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
159
+ messages.append({"role": "assistant", "content": response})
160
+ ```
161
+
162
+ ### Inference with `vllm`
163
+
164
+ You can also conduct inference with [vllm](https://docs.vllm.ai/en/stable/index.html), which is a fast and easy-to-use library for LLM inference and serving. To use vllm, first install the latest version via `pip install vllm`.
165
+
166
+ ```python
167
+ from vllm import LLM, SamplingParams
168
+
169
+ prompts = [
170
+ "Who is the president of US?",
171
+ "Can you speak Indonesian?"
172
+ ]
173
+
174
+ llm = LLM(ckpt_path, dtype="bfloat16")
175
+ sparams = SamplingParams(temperature=0.1, max_tokens=512)
176
+ outputs = llm.generate(prompts, sparams)
177
+
178
+ # print out the model response
179
+ for output in outputs:
180
+ prompt = output.prompt
181
+ generated_text = output.outputs[0].text
182
+ print(f"Prompt: {prompt}\nResponse: {generated_text}\n\n")
183
+ ```
184
+
185
+ ### Bias, Risks, and Limitations
186
+ <blockquote style="color:red">
187
+ <p><strong style="color: red">Terms of Use and License</strong>:
188
+ By using our released weights, codes, and demos, you agree to and comply with the terms and conditions specified in our <a href="https://huggingface.co/SeaLLMs/SeaLLM-Chat-13b/edit/main/LICENSE" target="_blank" rel="noopener">SeaLLMs Terms Of Use</a>.
189
+ </blockquote>
190
+
191
+ > **Disclaimer**:
192
+ > We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety fine-tuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation.
193
+ > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
194
+ > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.
195
+
196
+
197
+
198
+ ## Evaluation
199
+
200
+ We conduct our evaluation along two dimensions:
201
+
202
+ 1. **Model Capability**: We assess the model's performance on human exam questions, its ability to follow instructions, its proficiency in mathematics, and its translation accuracy.
203
+ 2. **Model Trustworthiness**: We evaluate the model's safety and tendency to hallucinate, particularly in the context of Southeast Asia.
204
+
205
+ ### Model Capability
206
+
207
+ #### Multilingual World Knowledge - M3Exam
208
+ [M3Exam](https://arxiv.org/abs/2306.05179) consists of local exam questions collected from each country. It reflects the model's world knowledge (e.g., with language or social science subjects) and reasoning abilities (e.g., with mathematics or natural science subjects).
209
+
210
+ | Model | en | zh | id | th | vi | avg | avg_sea |
211
+ |:-----------------|-----:|------:|-----:|-----:|-----:|------:|----------:|
212
+ | Sailor-7B-Chat | 0.66 | 0.652 | 0.475 | 0.462 | 0.513 | 0.552 | 0.483 |
213
+ | gemma-7b | 0.732 | 0.519 | 0.475 | 0.46 | 0.594 | 0.556 | 0.510 |
214
+ | SeaLLM-7B-v2.5 | 0.758 | 0.581 | 0.499 | 0.502 | 0.622 | 0.592 | 0.541 |
215
+ | Qwen2-7B | 0.815 | 0.874 | 0.53 | 0.479 | 0.628 | 0.665 | 0.546 |
216
+ | Qwen2-7B-Instruct| 0.809 | 0.88 | 0.558 | 0.555 | 0.624 | 0.685 | 0.579 |
217
+ | Sailor-14B | 0.748 | 0.84 | 0.536 | 0.528 | 0.621 | 0.655 | 0.562 |
218
+ | Sailor-14B-Chat | 0.749 | 0.843 | 0.553 | 0.566 | 0.637 | 0.67 | 0.585 |
219
+ | SeaLLMs-v3-7B | 0.809 | 0.863 | 0.545 | 0.530 | 0.628 | 0.675 | 0.568 |
220
+ | **SeaLLMs-v3-7B-Chat** | 0.809 | 0.874 | 0.558 | 0.569 | 0.649 | 0.692 | **0.592** |
221
+
222
+
223
+ #### Multilingual Instruction-following Capability - SeaBench
224
+ SeaBench consists of multi-turn human instructions spanning various task types. It evaluates chat-based models on their ability to follow human instructions in both single and multi-turn settings and assesses their performance across different task types. The dataset and corresponding evaluation code will be released soon!
225
+
226
+ | model | id<br>turn1 | id<br>turn2 | id<br>avg | th<br>turn1 | th<br>turn2 | th<br>avg | vi<br>turn1 | vi<br>turn2 | vi<br>avg | avg |
227
+ |:----------------|------------:|------------:|---------:|------------:|------------:|---------:|------------:|------------:|---------:|------:|
228
+ | Qwen2-7B-Instruct| 5.93 | 5.84 | 5.89 | 5.47 | 5.20 | 5.34 | 6.17 | 5.60 | 5.89 | 5.70 |
229
+ | SeaLLM-7B-v2.5 | 6.27 | 4.96 | 5.62 | 5.79 | 3.82 | 4.81 | 6.02 | 4.02 | 5.02 | 5.15 |
230
+ | Sailor-14B-Chat | 5.26 | 5.53 | 5.40 | 4.62 | 4.36 | 4.49 | 5.31 | 4.74 | 5.03 | 4.97 |
231
+ | Sailor-7B-Chat | 4.60 | 4.04 | 4.32 | 3.94 | 3.17 | 3.56 | 4.82 | 3.62 | 4.22 | 4.03 |
232
+ | **SeaLLMs-v3-7B-Chat** | 6.73 | 6.59 | 6.66 | 6.48 | 5.90 | 6.19 | 6.34 | 5.79 | 6.07 | **6.31** |
233
+
234
+
235
+ #### Multilingual Math
236
+ We evaluate the multilingual math capability using the MGSM dataset. MGSM originally contains Chinese and Thai testing sets only, we use Google Translate to translate the same English questions into other SEA languages. Note that we adopt the tradition of each country to represent the number, e.g., in Indonesian and Vietnamese, dots are used as thousands separators and commas as decimal separators, the opposite of the English system.
237
+
238
+ | MGSM | en | id | ms | th | vi | zh | avg |
239
+ |:--------------------------|------:|------:|------:|------:|------:|------:|------:|
240
+ | Sailor-7B-Chat | 33.6 | 22.4 | 22.4 | 21.6 | 25.2 | 29.2 | 25.7 |
241
+ | Meta-Llama-3-8B-Instruct | 77.6 | 48 | 57.6 | 56 | 46.8 | 58.8 | 57.5 |
242
+ | glm-4-9b-chat | 72.8 | 53.6 | 53.6 | 34.8 | 52.4 | 70.8 | 56.3 |
243
+ | Qwen1.5-7B-Chat | 64 | 34.4 | 38.4 | 25.2 | 36 | 53.6 | 41.9 |
244
+ | Qwen2-7B-instruct | 82 | 66.4 | 62.4 | 58.4 | 64.4 | 76.8 | 68.4 |
245
+ | aya-23-8B | 28.8 | 16.4 | 14.4 | 2 | 16 | 12.8 | 15.1 |
246
+ | gemma-1.1-7b-it | 58.8 | 32.4 | 34.8 | 31.2 | 39.6 | 35.2 | 38.7 |
247
+ | SeaLLM-7B-v2.5 | 79.6 | 69.2 | 70.8 | 61.2 | 66.8 | 62.4 | 68.3 |
248
+ | **SeaLLMs-v3-7B-Chat** | 74.8 | 71.2 | 70.8 | 71.2 | 71.2 | 79.6 | **73.1** |
249
+
250
+
251
+ #### Translation
252
+ We use the test sets from Flores-200 for evaluation and report the zero-shot chrF scores for translations between every pair of languages. Each row in the table below presents the average results of translating from various source languages into the target languages. The last column displays the overall average results of translating from any language to any other language for each model.
253
+
254
+ | model | en | id | jv | km | lo | ms | my | ta | th | tl | vi | zh | avg |
255
+ |:-----------------------------------------------|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|
256
+ |Meta-Llama-3-8B-Instruct | 51.54 | 49.03 | 22.46 | 15.34 | 5.42 | 46.72 | 21.24 | 32.09 | 35.75 | 40.8 | 39.31 | 14.87 | 31.22 |
257
+ |Qwen2-7B-Instruct | 50.36 | 47.55 | 29.36 | 19.26 | 11.06 | 42.43 | 19.33 | 20.04 | 36.07 | 37.91 | 39.63 | 22.87 | 31.32 |
258
+ |Sailor-7B-Chat | 49.4 | 49.78 | 28.33 | 2.68 | 6.85 | 47.75 | 5.35 | 18.23 | 38.92 | 29 | 41.76 | 20.87 | 28.24 |
259
+ |SeaLLM-7B-v2.5 | 55.09 | 53.71 | 18.13 | 18.09 | 15.53 | 51.33 | 19.71 | 26.1 | 40.55 | 45.58 | 44.56 | 24.18 | 34.38 |
260
+ |**SeaLLMs-v3-7B-Chat** | 54.68 | 52.52 | 29.86 | 27.3 | 26.34 | 45.04 | 21.54 | 31.93 | 41.52 | 38.51 | 43.78 | 26.1 | **36.52** |
261
+
262
+
263
+ ### Model Trustworthiness
264
+
265
+ #### Hallucination
266
+ Performance of whether a model can refuse questions about the non-existing entity. The following is the F1 score. We use refuse as the positive label. Our test set consists of ~1k test samples per language. Each unanswerable question is generated by GPT4o. The ratio of answerable and unanswerable questions are 1:1. We define keywords to automatically detect whether a model-generated response is a refusal response.
267
+
268
+ | Refusal-F1 Scores | en | zh | vi | th | id | avg |
269
+ |:---------------------|------:|------:|------:|------:|------:|-------:|
270
+ | Qwen1.5-7B-Instruct | 53.85 | 51.70 | 52.85 | 35.50 | 58.40 | 50.46 |
271
+ | Qwen2-7B-Instruct | 58.79 | 33.08 | 56.21 | 44.60 | 55.98 | 49.73 |
272
+ | SeaLLM-7B-v2.5 | 12.90 | 0.77 | 2.45 | 19.42 | 0.78 | 7.26 |
273
+ | Sailor-7B-Chat | 33.49 | 18.82 | 5.19 | 9.68 | 16.42 | 16.72 |
274
+ | glm-4-9b-chat | 44.48 | 37.89 | 18.66 | 4.27 | 1.97 | 21.45 |
275
+ | Llama-3-8B-Instruct | 72.08 | 0.00 | 1.23 | 0.80 | 3.91 | 15.60 |
276
+ | gemma-1.1-7b-it | 52.39 | 27.74 | 23.96 | 22.97 | 31.72 | 31.76 |
277
+ | **SeaLLMs-v3-7B-Chat** | 71.36 | 78.39 | 77.93 | 61.31 | 68.95 | **71.59** |
278
+
279
+
280
+ #### Safety
281
+ Multijaildataset consists of harmful prompts in multiple languages. We take those relevant prompts in SEA languages here and report their safe rate (the higher the better).
282
+
283
+ | Model | en | jv | th | vi | zh | avg |
284
+ |:------------------------|-------:|-------:|-------:|-------:|------:|-------:|
285
+ | Qwen2-7B-Instruct | 88.57 | 43.81 | 63.81 | 73.02 | 87.30 | 71.30 |
286
+ | Sailor-7B-Chat | 78.73 | 54.92 | 62.22 | 67.62 | 76.19 | 67.94 |
287
+ | Meta-Llama-3-8B-Instruct| 88.25 | 26.35 | 71.11 | 69.84 | 77.14 | 66.54 |
288
+ | Sailor-14B-Chat | 86.98 | 30.48 | 53.65 | 60.95 | 72.70 | 60.95 |
289
+ | glm-4-9b-chat | 77.14 | 21.27 | 30.16 | 60.63 | 74.92 | 52.82 |
290
+ | **SeaLLMs-v3-7B-Chat** | 88.89 | 60.00 | 73.33 | 83.81 | 92.70 | **79.75** |
291
+
292
+
293
+ ## Acknowledgement to Our Linguists
294
+ We would like to express our special thanks to our professional and native linguists, Tantong Champaiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.
295
+
296
+
297
+ ## Citation
298
+
299
+ If you find our project useful, we hope you would kindly star our repo and cite our work as follows:
300
+ ```
301
+ @article{damonlp2024seallm3,
302
+ author = {Wenxuan Zhang*, Hou Pong Chan*, Yiran Zhao*, Mahani Aljunied*,
303
+ Jianyu Wang*, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu,
304
+ Yew Ken Chia, Xin Li, Lidong Bing},
305
+ title = {SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages},
306
+ year = {2024},
307
+ url = {https://arxiv.org/abs/2407.19672}
308
+ }
309
+ ```
310
+ Corresponding Author: [email protected]
311
+