deploy_local_llm.mdx
Overview
This documentation page provides a comprehensive user guide for deploying and running large language models (LLMs) locally using various frameworks integrated with RAGFlow. It focuses on configuring local model deployment within RAGFlow via four supported frameworks:
Ollama
Xinference
IPEX-LLM accelerated Ollama
jina
The purpose of this file is to guide users through the steps required to:
Set up local model serving infrastructure.
Configure RAGFlow to connect to these local model servers.
Enable GPU/CUDA or other hardware acceleration options where applicable.
Use local models for both chat and embedding tasks within RAGFlow.
This documentation assumes users have basic knowledge of containerization, firewall/network configuration, and model serving concepts. It is not an installation guide for the underlying frameworks but focuses on integration and configuration inside RAGFlow.
Detailed Sections and Instructions
1. Deploy Local Models Using Ollama
Ollama is a framework that packages open-source LLMs for local deployment, bundling the model weights, configs, and runtime into a single manageable entity. It supports hardware acceleration with GPU/CUDA and exposes an HTTP API for interaction.
Steps to Deploy with Ollama:
1.1 Deploy Ollama using Docker
sudo docker run --name ollama -p 11434:11434 ollama/ollama
This runs Ollama in a Docker container exposing port
11434for API access.Verify Ollama is listening on all IP addresses:
sudo ss -tunlp | grep 11434
1.2 Pull Models
Examples:
sudo docker exec ollama ollama pull llama3.2 # 3B chat model
sudo docker exec ollama ollama pull bge-m3 # 567M embedding model
1.3 Find Ollama URL and Verify Accessibility
If RAGFlow runs in Docker, use
http://host.docker.internal:11434/.If RAGFlow and Ollama run on the same host, use
http://localhost:11434/.For remote Ollama, use its IP address URL.
Check connectivity with:
curl http://<ollama-url>:11434/
# Expected response: "Ollama is running"
1.4 Add and Configure Ollama in RAGFlow
Navigate in RAGFlow UI: Logo (top right) > Model providers > Add Ollama.
Fill in:
Model name matching pulled model (e.g.,
llama3.2).Model type (
chatorembedding).Base URL: e.g.,
http://host.docker.internal:11434/v1.Optional: Enable Vision support toggle if applicable.
Warning: Improper base URL causes connection refusal errors.
1.5 Update System Model Settings
Update chat and embedding models to the newly added Ollama models in System Model Settings.
1.6 Update Chat Configuration
Adjust chat configuration to use the new local Ollama models.
2. Deploy Local Models Using Xinference
Xinference is an inference server supporting AI models with acceleration. It provides HTTP endpoints for local serving.
Steps to Deploy with Xinference:
2.1 Check Firewall
Allow inbound on port 9997.
sudo ufw allow 9997/tcp
2.2 Start Xinference Server
xinference-local --host 0.0.0.0 --port 9997
2.3 Launch Model (e.g., Mistral)
xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
2.4 Add Xinference Provider in RAGFlow
Navigate: Logo > Model providers > Add Xinference.
Enter base URL: e.g.,
http://<xinference-host>:9997/v1.For rerank models, use
http://<xinference-host>:9997/v1/rerank.
2.5 Update System Model Settings
Select the Xinference model from dropdowns under Chat model or Embedding model.
2.6 Update Chat or Embedding Configuration
Assign the local Xinference model in the chat or embedding setup of RAGFlow.
3. Deploy Local Models Using IPEX-LLM Accelerated Ollama
IPEX-LLM is a PyTorch extension optimized for Intel CPUs/GPUs allowing low latency LLM inference, integrated here with Ollama serving.
Steps to Deploy with IPEX-LLM:
3.1 Firewall Configuration
Allow port 11434 inbound.
sudo ufw allow 11434/tcp
3.2 Install and Initialize IPEX-LLM for Ollama
Follow official IPEX-LLM Quickstart guide.
Create and activate Conda environment (e.g.,
llm-cpp).Run:
conda activate llm-cpp
init-ollama # Linux
or
conda activate llm-cpp
init-ollama.bat # Windows (Admin Prompt)
3.3 Launch Ollama Service with IPEX-LLM
Set environment variables for GPU usage and performance optimizations, then run Ollama serve:
Linux example:
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve
Windows example (Miniforge Prompt):
set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
ollama serve
Tip: Use OLLAMA_HOST=0.0.0.0 ./ollama serve to listen on all interfaces.
3.4 Pull and Run Ollama Models
In a new terminal:
Linux:
./ollama pull qwen2:latest
./ollama run qwen2:latest
Windows:
ollama.exe pull qwen2:latest
ollama run qwen2:latest
3.5 Configure RAGFlow
Follow the same RAGFlow integration steps as for Ollama:
Add Ollama provider.
Complete settings with correct model names and URLs.
Update system model settings.
Update chat configurations.
4. Deploy Local Models Using jina
jina is a neural search framework that can serve models locally through its server script.
Steps to Deploy with jina:
4.1 Firewall Configuration
Allow port 12345 inbound.
sudo ufw allow 12345/tcp
4.2 Install jina Package
pip install jina
4.3 Run jina Server
Navigate to rag/svr directory:
cd rag/svr
Run the server specifying the model (supports only Hugging Face models):
python jina_server.py --model_name gpt2
Important Implementation Details
Model Types: RAGFlow distinguishes between chat models and embedding models. Local deployments must specify the correct type during configuration.
Network Accessibility: Proper firewall and network configurations are essential to allow RAGFlow (which might run inside Docker or on a separate host) to communicate with local model serving endpoints.
Hardware Acceleration: Ollama supports GPU/CUDA acceleration; IPEX-LLM optimizes inference on Intel hardware. Proper environment variables and runtime configurations enhance performance.
Docker Considerations: When running RAGFlow and Ollama in Docker, the hostname
host.docker.internalis used to access host services from inside containers.Configuration Management: The RAGFlow UI provides sections for adding model providers and updating system-wide and chat-specific model settings, enabling seamless switching between local and remote models.
Interaction with Other System Components
RAGFlow UI: This file documents how users configure local model providers via the RAGFlow user interface, specifically under Model providers, System Model Settings, and Chat Configuration.
Docker Containers: The document describes considerations when running both RAGFlow and local model servers inside Docker containers, focusing on network accessibility.
Model Serving Backends: Ollama, Xinference, IPEX-LLM, and jina act as local backend servers providing RESTful APIs for inference. RAGFlow acts as a client consuming these APIs.
Firewall and Network: The document emphasizes network setup to allow communication between RAGFlow and local model servers.
Usage Examples
Example: Adding Ollama Chat Model to RAGFlow
Pull model in Ollama Docker:
sudo docker exec ollama ollama pull llama3.2
Verify Ollama endpoint:
curl http://host.docker.internal:11434/
# Output: Ollama is running
In RAGFlow UI:
Add Ollama as model provider.
Set base URL
http://host.docker.internal:11434/v1.Select model name
llama3.2and typechat.Save and update system models.
Use
llama3.2in chat configuration.
Visual Diagram: Flowchart of Local Model Deployment and Configuration Workflow
flowchart TD
A[Start: User wants to deploy local LLM] --> B{Choose Framework}
B -->|Ollama| C[Deploy Ollama Service]
B -->|Xinference| D[Deploy Xinference Server]
B -->|IPEX-LLM Ollama| E[Install IPEX-LLM & Launch Ollama]
B -->|jina| F[Run jina Server]
C --> G[Pull required models]
D --> H[Launch model(s)]
E --> I[Pull & Run models]
F --> J[Run jina_server.py with model]
G --> K[Configure RAGFlow: Add Ollama provider]
H --> L[Configure RAGFlow: Add Xinference provider]
I --> K
J --> M[Configure RAGFlow: Add jina provider]
K --> N[Set base URL, Model Name, and Type]
L --> N
M --> N
N --> O[Update System Model Settings]
O --> P[Update Chat/Embedding Configuration]
P --> Q[Local model ready for inference in RAGFlow]
style A fill:#f9f,stroke:#333,stroke-width:2px
style Q fill:#9f6,stroke:#333,stroke-width:2px
Summary
This documentation file is a crucial resource for users wanting to leverage local LLM deployments integrated into RAGFlow. It provides detailed procedural steps for four major frameworks, highlighting network, environment, and configuration needs, enabling efficient use of local models for chat and embedding tasks.
End of deploy_local_llm.mdx documentation.