deploy_local_llm.mdx

Overview

This documentation page provides a comprehensive user guide for deploying and running large language models (LLMs) locally using various frameworks integrated with RAGFlow. It focuses on configuring local model deployment within RAGFlow via four supported frameworks:

Ollama
Xinference
IPEX-LLM accelerated Ollama
jina

The purpose of this file is to guide users through the steps required to:

Set up local model serving infrastructure.
Configure RAGFlow to connect to these local model servers.
Enable GPU/CUDA or other hardware acceleration options where applicable.
Use local models for both chat and embedding tasks within RAGFlow.

This documentation assumes users have basic knowledge of containerization, firewall/network configuration, and model serving concepts. It is not an installation guide for the underlying frameworks but focuses on integration and configuration inside RAGFlow.

Detailed Sections and Instructions

1. Deploy Local Models Using Ollama

Ollama is a framework that packages open-source LLMs for local deployment, bundling the model weights, configs, and runtime into a single manageable entity. It supports hardware acceleration with GPU/CUDA and exposes an HTTP API for interaction.

Steps to Deploy with Ollama:

1.1 Deploy Ollama using Docker

sudo docker run --name ollama -p 11434:11434 ollama/ollama

This runs Ollama in a Docker container exposing port 11434 for API access.
Verify Ollama is listening on all IP addresses:

sudo ss -tunlp | grep 11434

1.2 Pull Models

Examples:

sudo docker exec ollama ollama pull llama3.2    # 3B chat model
sudo docker exec ollama ollama pull bge-m3      # 567M embedding model

1.3 Find Ollama URL and Verify Accessibility

If RAGFlow runs in Docker, use http://host.docker.internal:11434/.
If RAGFlow and Ollama run on the same host, use http://localhost:11434/.
For remote Ollama, use its IP address URL.

Check connectivity with:

curl http://<ollama-url>:11434/
# Expected response: "Ollama is running"

1.4 Add and Configure Ollama in RAGFlow

Navigate in RAGFlow UI: Logo (top right) > Model providers > Add Ollama.
Fill in:
- Model name matching pulled model (e.g., llama3.2).
- Model type (chat or embedding).
- Base URL: e.g., http://host.docker.internal:11434/v1.
- Optional: Enable Vision support toggle if applicable.

Warning: Improper base URL causes connection refusal errors.

1.5 Update System Model Settings

Update chat and embedding models to the newly added Ollama models in System Model Settings.

1.6 Update Chat Configuration

Adjust chat configuration to use the new local Ollama models.

2. Deploy Local Models Using Xinference

Xinference is an inference server supporting AI models with acceleration. It provides HTTP endpoints for local serving.

Steps to Deploy with Xinference:

2.1 Check Firewall

Allow inbound on port 9997.

sudo ufw allow 9997/tcp

2.2 Start Xinference Server

xinference-local --host 0.0.0.0 --port 9997

2.3 Launch Model (e.g., Mistral)

xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}

2.4 Add Xinference Provider in RAGFlow

Navigate: Logo > Model providers > Add Xinference.
Enter base URL: e.g., http://<xinference-host>:9997/v1.
For rerank models, use http://<xinference-host>:9997/v1/rerank.

2.5 Update System Model Settings

Select the Xinference model from dropdowns under Chat model or Embedding model.

2.6 Update Chat or Embedding Configuration

Assign the local Xinference model in the chat or embedding setup of RAGFlow.

3. Deploy Local Models Using IPEX-LLM Accelerated Ollama

IPEX-LLM is a PyTorch extension optimized for Intel CPUs/GPUs allowing low latency LLM inference, integrated here with Ollama serving.

Steps to Deploy with IPEX-LLM:

3.1 Firewall Configuration

Allow port 11434 inbound.

sudo ufw allow 11434/tcp

3.2 Install and Initialize IPEX-LLM for Ollama

Follow official IPEX-LLM Quickstart guide.
Create and activate Conda environment (e.g., llm-cpp).
Run:

conda activate llm-cpp
init-ollama    # Linux

conda activate llm-cpp
init-ollama.bat    # Windows (Admin Prompt)

3.3 Launch Ollama Service with IPEX-LLM

Set environment variables for GPU usage and performance optimizations, then run Ollama serve:

Linux example:

export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve

Windows example (Miniforge Prompt):

set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
ollama serve

Tip: Use OLLAMA_HOST=0.0.0.0 ./ollama serve to listen on all interfaces.

3.4 Pull and Run Ollama Models

In a new terminal:

Linux:

./ollama pull qwen2:latest
./ollama run qwen2:latest

Windows:

ollama.exe pull qwen2:latest
ollama run qwen2:latest

3.5 Configure RAGFlow

Follow the same RAGFlow integration steps as for Ollama:

Add Ollama provider.
Complete settings with correct model names and URLs.
Update system model settings.
Update chat configurations.

4. Deploy Local Models Using jina

jina is a neural search framework that can serve models locally through its server script.

Steps to Deploy with jina:

4.1 Firewall Configuration

Allow port 12345 inbound.

sudo ufw allow 12345/tcp

4.2 Install jina Package

pip install jina

4.3 Run jina Server

Navigate to rag/svr directory:

cd rag/svr

Run the server specifying the model (supports only Hugging Face models):

python jina_server.py --model_name gpt2

Important Implementation Details

Model Types: RAGFlow distinguishes between chat models and embedding models. Local deployments must specify the correct type during configuration.
Network Accessibility: Proper firewall and network configurations are essential to allow RAGFlow (which might run inside Docker or on a separate host) to communicate with local model serving endpoints.
Hardware Acceleration: Ollama supports GPU/CUDA acceleration; IPEX-LLM optimizes inference on Intel hardware. Proper environment variables and runtime configurations enhance performance.
Docker Considerations: When running RAGFlow and Ollama in Docker, the hostname host.docker.internal is used to access host services from inside containers.
Configuration Management: The RAGFlow UI provides sections for adding model providers and updating system-wide and chat-specific model settings, enabling seamless switching between local and remote models.

Interaction with Other System Components

RAGFlow UI: This file documents how users configure local model providers via the RAGFlow user interface, specifically under Model providers, System Model Settings, and Chat Configuration.
Docker Containers: The document describes considerations when running both RAGFlow and local model servers inside Docker containers, focusing on network accessibility.
Model Serving Backends: Ollama, Xinference, IPEX-LLM, and jina act as local backend servers providing RESTful APIs for inference. RAGFlow acts as a client consuming these APIs.
Firewall and Network: The document emphasizes network setup to allow communication between RAGFlow and local model servers.

Usage Examples

Example: Adding Ollama Chat Model to RAGFlow

Pull model in Ollama Docker:

sudo docker exec ollama ollama pull llama3.2

Verify Ollama endpoint:

curl http://host.docker.internal:11434/
# Output: Ollama is running

In RAGFlow UI:

Add Ollama as model provider.
Set base URL http://host.docker.internal:11434/v1.
Select model name llama3.2 and type chat.
Save and update system models.
Use llama3.2 in chat configuration.

Visual Diagram: Flowchart of Local Model Deployment and Configuration Workflow

flowchart TD
    A[Start: User wants to deploy local LLM] --> B{Choose Framework}
    B -->|Ollama| C[Deploy Ollama Service]
    B -->|Xinference| D[Deploy Xinference Server]
    B -->|IPEX-LLM Ollama| E[Install IPEX-LLM & Launch Ollama]
    B -->|jina| F[Run jina Server]

    C --> G[Pull required models]
    D --> H[Launch model(s)]
    E --> I[Pull & Run models]
    F --> J[Run jina_server.py with model]

    G --> K[Configure RAGFlow: Add Ollama provider]
    H --> L[Configure RAGFlow: Add Xinference provider]
    I --> K
    J --> M[Configure RAGFlow: Add jina provider]

    K --> N[Set base URL, Model Name, and Type]
    L --> N
    M --> N

    N --> O[Update System Model Settings]
    O --> P[Update Chat/Embedding Configuration]
    P --> Q[Local model ready for inference in RAGFlow]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style Q fill:#9f6,stroke:#333,stroke-width:2px

Summary

This documentation file is a crucial resource for users wanting to leverage local LLM deployments integrated into RAGFlow. It provides detailed procedural steps for four major frameworks, highlighting network, environment, and configuration needs, enabling efficient use of local models for chat and embedding tasks.

End of deploy_local_llm.mdx documentation.