skywa1ker77 commited on
Commit
673a675
·
1 Parent(s): 9ee6e3a

Add Intel IPEX-LLM setup under deploy_local_llm (#1269)

Browse files

### What problem does this PR solve?

It adds the setup guide for using Intel IPEX-LLM with Ollama to
docs/guide/deploy_local_llm.md

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [x] Other (please describe): adds the setup guide for using Intel
IPEX-LLM with Ollama to docs/guide/deploy_local_llm.md

Files changed (1) hide show
  1. docs/guides/deploy_local_llm.md +129 -1
docs/guides/deploy_local_llm.md CHANGED
@@ -156,4 +156,132 @@ Click on your logo **>** **Model Providers** **>** **System Model Settings** to
156
 
157
  Update your chat model accordingly in **Chat Configuration**:
158
 
159
- > If your local model is an embedding model, update it on the configruation page of your knowledge base.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
  Update your chat model accordingly in **Chat Configuration**:
158
 
159
+ > If your local model is an embedding model, update it on the configruation page of your knowledge base.
160
+
161
+ ## Deploy a local model using IPEX-LLM
162
+
163
+ IPEX-LLM([IPEX-LLM](https://github.com/intel-analytics/ipex-llm)) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency
164
+
165
+ To deploy a local model, eg., **Qwen2**, using IPEX-LLM, follow the steps below:
166
+
167
+ ### 1. Check firewall settings
168
+
169
+ Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
170
+
171
+ ```bash
172
+ sudo ufw allow 11434/tcp
173
+ ```
174
+
175
+ ### 2. Install and Start Ollama serve using IPEX-LLM
176
+
177
+ #### 2.1 Install IPEX-LLM for Ollama
178
+
179
+ IPEX-LLM's support for `ollama` now is available for Linux system and Windows system.
180
+
181
+ Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md), and follow the instructions in section [Prerequisites](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#0-prerequisites) to setup and section [Install IPEX-LLM cpp](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#1-install-ipex-llm-for-llamacpp) to install the IPEX-LLM with Ollama binaries.
182
+
183
+ **After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `ollama` commands with IPEX-LLM.**
184
+
185
+ #### 2.2 Initialize Ollama
186
+
187
+ Activate the `llm-cpp` conda environment and initialize Ollama by executing the commands below. A symbolic link to `ollama` will appear in your current directory.
188
+
189
+ - For **Linux users**:
190
+
191
+ ```bash
192
+ conda activate llm-cpp
193
+ init-ollama
194
+ ```
195
+
196
+ - For **Windows users**:
197
+
198
+ Please run the following command with **administrator privilege in Miniforge Prompt**.
199
+
200
+ ```cmd
201
+ conda activate llm-cpp
202
+ init-ollama.bat
203
+ ```
204
+
205
+ > [!NOTE]
206
+ > If you have installed higher version `ipex-llm[cpp]` and want to upgrade your ollama binary file, don't forget to remove old binary files first and initialize again with `init-ollama` or `init-ollama.bat`.
207
+
208
+ **Now you can use this executable file by standard ollama's usage.**
209
+
210
+ #### 2.3 Run Ollama Serve
211
+
212
+ You may launch the Ollama service as below:
213
+
214
+ - For **Linux users**:
215
+
216
+ ```bash
217
+ export OLLAMA_NUM_GPU=999
218
+ export no_proxy=localhost,127.0.0.1
219
+ export ZES_ENABLE_SYSMAN=1
220
+ source /opt/intel/oneapi/setvars.sh
221
+ export SYCL_CACHE_PERSISTENT=1
222
+
223
+ ./ollama serve
224
+ ```
225
+
226
+ - For **Windows users**:
227
+
228
+ Please run the following command in Miniforge Prompt.
229
+
230
+ ```cmd
231
+ set OLLAMA_NUM_GPU=999
232
+ set no_proxy=localhost,127.0.0.1
233
+ set ZES_ENABLE_SYSMAN=1
234
+ set SYCL_CACHE_PERSISTENT=1
235
+
236
+ ollama serve
237
+ ```
238
+
239
+ > [!NOTE]
240
+ > Please set environment variable `OLLAMA_NUM_GPU` to `999` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
241
+
242
+ > [!TIP]
243
+ > If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
244
+ >
245
+ > ```bash
246
+ > export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
247
+ > ```
248
+
249
+ > [!NOTE]
250
+ > To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
251
+
252
+ The console will display messages similar to the following:
253
+
254
+ <a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png" target="_blank">
255
+ <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png" width=100%; />
256
+ </a>
257
+
258
+ ### 3. Pull and Run Ollama Model
259
+
260
+ Keep the Ollama service on and open another terminal and run `./ollama pull <model_name>` in Linux (`ollama.exe pull <model_name>` in Windows) to automatically pull a model. e.g. `qwen2:latest`:
261
+
262
+ <a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png" target="_blank">
263
+ <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png" width=100%; />
264
+ </a>
265
+
266
+ #### Run Ollama Model
267
+
268
+ - For **Linux users**:
269
+ ```bash
270
+ ./ollama run qwen2:latest
271
+ ```
272
+
273
+ - For **Windows users**:
274
+ ```cmd
275
+ ollama run qwen2:latest
276
+ ```
277
+ ### 4. Configure RAGflow to use IPEX-LLM accelerated Ollama
278
+
279
+ The confiugraiton follows the steps in
280
+
281
+ Ollama Section 4 [Add Ollama](#4-add-ollama),
282
+
283
+ Section 5 [Complete basic Ollama settings](#5-complete-basic-ollama-settings),
284
+
285
+ Section 6 [Update System Model Settings](#6-update-system-model-settings),
286
+
287
+ Section 7 [Update Chat Configuration](#7-update-chat-configuration)