itod commited on
Commit
b094d8a
·
verified ·
1 Parent(s): 06403c3

Update README.md

Browse files

- Source Mistral 7B model:</br>
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/

- This model is converted from Bfloat16 datatype to Int8 datatype with convert tool from:</br>
https://github.com/ggerganov/llama.cpp

- Deployment on CPU:</br>
Pull the ready-made llama.cpp container:
```
docker pull ghcr.io/ggerganov/llama.cpp:server
```
Assuming mistral-7B-instruct-v0.2-q8.gguf file is downloaded to /path/to/models directory on local machine, run the container accesing the model with:
```
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/istral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512
```
- Test the deployment accessing the model with the browser at http://localhost:8000
- llama.cpp server also provides OpenAI compatible API
- Deployment on CUDA GPU:</br>
```
docker pull ghcr.io/ggerganov/llama.cpp:server-cuda
```
```
docker run --gpus all -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server-cuda -m /models/mistral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 50
```
- If CUDA GPU with 16GB RAM is available, the version of the model converted to float16 may be interesting, available in this repo:</br>
https://huggingface.co/itod/mistral-7B-instruct-v0.2-f16
- More details about usage is avalable in llama.cpp documentation:</br>
https://github.com/ggerganov/llama.cpp/tree/master/examples/server

Files changed (1) hide show
  1. README.md +27 -2
README.md CHANGED
@@ -1,6 +1,31 @@
1
  ---
2
  license: mit
3
  ---
4
- # Original Mistral 7B model:
 
5
 
6
- https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ - Source Mistral 7B model:</br>
5
+ https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/
6
 
7
+ - This model is converted from Bfloat16 datatype to Int8 datatype with convert tool from:</br>
8
+ https://github.com/ggerganov/llama.cpp
9
+
10
+ - Deployment on CPU:</br>
11
+ Pull the ready-made llama.cpp container:
12
+ ```
13
+ docker pull ghcr.io/ggerganov/llama.cpp:server
14
+ ```
15
+ Assuming mistral-7B-instruct-v0.2-q8.gguf file is downloaded to /path/to/models directory on local machine, run the container accesing the model with:
16
+ ```
17
+ docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/istral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512
18
+ ```
19
+ - Test the deployment accessing the model with the browser at http://localhost:8000
20
+ - llama.cpp server also provides OpenAI compatible API
21
+ - Deployment on CUDA GPU:</br>
22
+ ```
23
+ docker pull ghcr.io/ggerganov/llama.cpp:server-cuda
24
+ ```
25
+ ```
26
+ docker run --gpus all -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server-cuda -m /models/mistral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 50
27
+ ```
28
+ - If CUDA GPU with 16GB RAM is available, the version of the model converted to float16 may be interesting, available in this repo:</br>
29
+ https://huggingface.co/itod/mistral-7B-instruct-v0.2-f16
30
+ - More details about usage is avalable in llama.cpp documentation:</br>
31
+ https://github.com/ggerganov/llama.cpp/tree/master/examples/server