Update README.md
Browse files- Source Mistral 7B model:</br>
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/
- This model is converted from Bfloat16 datatype to Int8 datatype with convert tool from:</br>
https://github.com/ggerganov/llama.cpp
- Deployment on CPU:</br>
Pull the ready-made llama.cpp container:
```
docker pull ghcr.io/ggerganov/llama.cpp:server
```
Assuming mistral-7B-instruct-v0.2-q8.gguf file is downloaded to /path/to/models directory on local machine, run the container accesing the model with:
```
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/istral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512
```
- Test the deployment accessing the model with the browser at http://localhost:8000
- llama.cpp server also provides OpenAI compatible API
- Deployment on CUDA GPU:</br>
```
docker pull ghcr.io/ggerganov/llama.cpp:server-cuda
```
```
docker run --gpus all -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server-cuda -m /models/mistral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 50
```
- If CUDA GPU with 16GB RAM is available, the version of the model converted to float16 may be interesting, available in this repo:</br>
https://huggingface.co/itod/mistral-7B-instruct-v0.2-f16
- More details about usage is avalable in llama.cpp documentation:</br>
https://github.com/ggerganov/llama.cpp/tree/master/examples/server
@@ -1,6 +1,31 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
-
|
|
|
5 |
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
- Source Mistral 7B model:</br>
|
5 |
+
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/
|
6 |
|
7 |
+
- This model is converted from Bfloat16 datatype to Int8 datatype with convert tool from:</br>
|
8 |
+
https://github.com/ggerganov/llama.cpp
|
9 |
+
|
10 |
+
- Deployment on CPU:</br>
|
11 |
+
Pull the ready-made llama.cpp container:
|
12 |
+
```
|
13 |
+
docker pull ghcr.io/ggerganov/llama.cpp:server
|
14 |
+
```
|
15 |
+
Assuming mistral-7B-instruct-v0.2-q8.gguf file is downloaded to /path/to/models directory on local machine, run the container accesing the model with:
|
16 |
+
```
|
17 |
+
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/istral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512
|
18 |
+
```
|
19 |
+
- Test the deployment accessing the model with the browser at http://localhost:8000
|
20 |
+
- llama.cpp server also provides OpenAI compatible API
|
21 |
+
- Deployment on CUDA GPU:</br>
|
22 |
+
```
|
23 |
+
docker pull ghcr.io/ggerganov/llama.cpp:server-cuda
|
24 |
+
```
|
25 |
+
```
|
26 |
+
docker run --gpus all -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server-cuda -m /models/mistral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 50
|
27 |
+
```
|
28 |
+
- If CUDA GPU with 16GB RAM is available, the version of the model converted to float16 may be interesting, available in this repo:</br>
|
29 |
+
https://huggingface.co/itod/mistral-7B-instruct-v0.2-f16
|
30 |
+
- More details about usage is avalable in llama.cpp documentation:</br>
|
31 |
+
https://github.com/ggerganov/llama.cpp/tree/master/examples/server
|