deploy_local_llm.mdx


Overview

This documentation page provides a comprehensive user guide for deploying and running large language models (LLMs) locally using various frameworks integrated with RAGFlow. It focuses on configuring local model deployment within RAGFlow via four supported frameworks:

The purpose of this file is to guide users through the steps required to:

This documentation assumes users have basic knowledge of containerization, firewall/network configuration, and model serving concepts. It is not an installation guide for the underlying frameworks but focuses on integration and configuration inside RAGFlow.


Detailed Sections and Instructions


1. Deploy Local Models Using Ollama

Ollama is a framework that packages open-source LLMs for local deployment, bundling the model weights, configs, and runtime into a single manageable entity. It supports hardware acceleration with GPU/CUDA and exposes an HTTP API for interaction.


Steps to Deploy with Ollama:

1.1 Deploy Ollama using Docker
sudo docker run --name ollama -p 11434:11434 ollama/ollama
sudo ss -tunlp | grep 11434
1.2 Pull Models

Examples:

sudo docker exec ollama ollama pull llama3.2    # 3B chat model
sudo docker exec ollama ollama pull bge-m3      # 567M embedding model
1.3 Find Ollama URL and Verify Accessibility

Check connectivity with:

curl http://<ollama-url>:11434/
# Expected response: "Ollama is running"
1.4 Add and Configure Ollama in RAGFlow

Warning: Improper base URL causes connection refusal errors.

1.5 Update System Model Settings
1.6 Update Chat Configuration

2. Deploy Local Models Using Xinference

Xinference is an inference server supporting AI models with acceleration. It provides HTTP endpoints for local serving.


Steps to Deploy with Xinference:

2.1 Check Firewall

Allow inbound on port 9997.

sudo ufw allow 9997/tcp
2.2 Start Xinference Server
xinference-local --host 0.0.0.0 --port 9997
2.3 Launch Model (e.g., Mistral)
xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
2.4 Add Xinference Provider in RAGFlow
2.5 Update System Model Settings
2.6 Update Chat or Embedding Configuration

3. Deploy Local Models Using IPEX-LLM Accelerated Ollama

IPEX-LLM is a PyTorch extension optimized for Intel CPUs/GPUs allowing low latency LLM inference, integrated here with Ollama serving.


Steps to Deploy with IPEX-LLM:

3.1 Firewall Configuration

Allow port 11434 inbound.

sudo ufw allow 11434/tcp
3.2 Install and Initialize IPEX-LLM for Ollama
conda activate llm-cpp
init-ollama    # Linux

or

conda activate llm-cpp
init-ollama.bat    # Windows (Admin Prompt)
3.3 Launch Ollama Service with IPEX-LLM

Set environment variables for GPU usage and performance optimizations, then run Ollama serve:

Linux example:

export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve

Windows example (Miniforge Prompt):

set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
ollama serve

Tip: Use OLLAMA_HOST=0.0.0.0 ./ollama serve to listen on all interfaces.

3.4 Pull and Run Ollama Models

In a new terminal:

Linux:

./ollama pull qwen2:latest
./ollama run qwen2:latest

Windows:

ollama.exe pull qwen2:latest
ollama run qwen2:latest
3.5 Configure RAGFlow

Follow the same RAGFlow integration steps as for Ollama:


4. Deploy Local Models Using jina

jina is a neural search framework that can serve models locally through its server script.


Steps to Deploy with jina:

4.1 Firewall Configuration

Allow port 12345 inbound.

sudo ufw allow 12345/tcp
4.2 Install jina Package
pip install jina
4.3 Run jina Server

Navigate to rag/svr directory:

cd rag/svr

Run the server specifying the model (supports only Hugging Face models):

python jina_server.py --model_name gpt2

Important Implementation Details


Interaction with Other System Components


Usage Examples

Example: Adding Ollama Chat Model to RAGFlow

  1. Pull model in Ollama Docker:

sudo docker exec ollama ollama pull llama3.2
  1. Verify Ollama endpoint:

curl http://host.docker.internal:11434/
# Output: Ollama is running
  1. In RAGFlow UI:


Visual Diagram: Flowchart of Local Model Deployment and Configuration Workflow

flowchart TD
    A[Start: User wants to deploy local LLM] --> B{Choose Framework}
    B -->|Ollama| C[Deploy Ollama Service]
    B -->|Xinference| D[Deploy Xinference Server]
    B -->|IPEX-LLM Ollama| E[Install IPEX-LLM & Launch Ollama]
    B -->|jina| F[Run jina Server]

    C --> G[Pull required models]
    D --> H[Launch model(s)]
    E --> I[Pull & Run models]
    F --> J[Run jina_server.py with model]

    G --> K[Configure RAGFlow: Add Ollama provider]
    H --> L[Configure RAGFlow: Add Xinference provider]
    I --> K
    J --> M[Configure RAGFlow: Add jina provider]

    K --> N[Set base URL, Model Name, and Type]
    L --> N
    M --> N

    N --> O[Update System Model Settings]
    O --> P[Update Chat/Embedding Configuration]
    P --> Q[Local model ready for inference in RAGFlow]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style Q fill:#9f6,stroke:#333,stroke-width:2px

Summary

This documentation file is a crucial resource for users wanting to leverage local LLM deployments integrated into RAGFlow. It provides detailed procedural steps for four major frameworks, highlighting network, environment, and configuration needs, enabling efficient use of local models for chat and embedding tasks.


End of deploy_local_llm.mdx documentation.