RAG System - Local Document Search with AI

A privacy-first, local RAG (Retrieval Augmented Generation) system that enables natural language search across your documents using AI through a REST API.

Features

Phase 2: High-Impact Retrieval (Current)

Hybrid Search (BM25 + Vector + RRF): 48% improvement in retrieval quality
- Combines sparse (BM25) keyword search with dense (vector) semantic search
- Reciprocal Rank Fusion (k=60) for optimal result merging
- Excels at exact term matching and semantic understanding
Contextual Retrieval (Anthropic Method): 49% reduction in retrieval failures
- LLM-generated document context prepended to chunks
- Zero query-time overhead (context embedded once at indexing)
- 67% reduction in failures when combined with reranking

Core Capabilities

Natural Language Search: Ask questions in plain English and get AI-powered answers
Advanced Document Processing: Powered by Docling with superior PDF/DOCX parsing, table extraction, and layout understanding
Multi-Format Support: Process txt, md, pdf, docx, pptx, xlsx, html, and more
Intelligent Chunking: Structural chunking that preserves document hierarchy (headings, sections, tables)
Two-Stage Retrieval: Hybrid search (BM25 + Vector) → cross-encoder reranking
Conversational Memory: Redis-backed chat history that persists across restarts with session management
Data Protection: Automated ChromaDB backups with restore capability
Local Deployment: All data stays on your machine
Async Document Processing: Celery + Redis for background uploads with real-time progress tracking
REST API: Clean API for integration with any frontend or application

Technology Stack

RAG Server: Python + FastAPI (REST API)
Vector Database: ChromaDB with LlamaIndex integration
LLM: Ollama (gemma3:4b for generation and evaluation)
Embeddings: nomic-embed-text:latest via LlamaIndex OllamaEmbedding
Document Processing: Docling + LlamaIndex DoclingReader/DoclingNodeParser
Chunking: Docling structural chunking (preserves document hierarchy)
Hybrid Search: BM25 (sparse) + Vector (dense) with Reciprocal Rank Fusion
Reranking: SentenceTransformer cross-encoder (ms-marco-MiniLM-L-6-v2)
Orchestration: Docker Compose
Package Management: uv
Task Queue: Celery + Redis (async document processing, chat history persistence, progress tracking)

Architecture

┌─────────────────┐
│  Client Apps    │  (Your web app, CLI, etc.)
│  (Future)       │
└────────┬────────┘
         │ HTTP/REST
         │
┌────────▼────────┐
│   RAG Server    │  (Port 8001, Public API)
│   (FastAPI)     │
│                 │
│  ┌──────────┐  │
│  │ Docling  │  │  Document parsing
│  │  +       │  │  + Contextual Retrieval
│  │LlamaIndex│  │  + Hybrid Search (BM25+Vector)
│  └──────────┘  │
└────────┬────────┘
         │
    ┌────┴────┬────────────┬──────────┐
    │         │            │          │
┌───▼───┐ ┌──▼──────┐  ┌──▼────┐  ┌─▼─────┐
│ChromaDB│ │ Ollama  │  │ Redis │  │Celery │
│ (8000) │ │ (11434) │  │(6379) │  │Worker │
└────────┘ └─────────┘  └───────┘  └───────┘
   Vector     LLM +        Chat      Async
   Storage   Embeddings   Memory     Tasks

Prerequisites

Docker & Docker Compose
- Docker Desktop or Docker Engine
- Docker Compose v2+

Ollama (running on host)

# Install Ollama
curl https://ollama.ai/install.sh | sh

# Pull required models
ollama pull gemma3:4b              # LLM for generation
ollama pull nomic-embed-text       # Embeddings

Python 3.13+ (for local development/testing)

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

Quick Start

1. Clone and Setup

cd /path/to/rag-docling

2. Configure Secrets (Optional)

# Create secrets directory
mkdir -p secrets

# Add Ollama configuration if needed
echo "OLLAMA_HOST=http://host.docker.internal:11434" > secrets/ollama_config.env

3. Start Services

docker compose up -d

This will start:

RAG Server API on http://localhost:8001
ChromaDB (internal)
Redis (internal)
Celery Worker (internal)

4. Verify Services

# Check health
curl http://localhost:8001/health

# Check models
curl http://localhost:8001/models/info

API Usage

Upload Documents

# Upload single file
curl -X POST http://localhost:8001/upload \
  -F "files=@/path/to/document.pdf"

# Upload multiple files
curl -X POST http://localhost:8001/upload \
  -F "files=@document1.pdf" \
  -F "files=@document2.docx"

# Response includes batch_id for tracking
{
  "status": "queued",
  "batch_id": "abc-123",
  "tasks": [
    {"task_id": "task-1", "filename": "document1.pdf"}
  ]
}

Check Upload Progress

# Get batch status
curl http://localhost:8001/tasks/{batch_id}/status

Query Documents

# Simple query (auto-generates session_id)
curl -X POST http://localhost:8001/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the main topic?",
    "n_results": 5
  }'

# Conversational query with session
curl -X POST http://localhost:8001/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Tell me more about that",
    "session_id": "user-123",
    "n_results": 5
  }'

List Documents

curl http://localhost:8001/documents

Delete Document

curl -X DELETE http://localhost:8001/documents/{document_id}

Chat History

# Get conversation history
curl http://localhost:8001/chat/history/{session_id}

# Clear chat history
curl -X POST http://localhost:8001/chat/clear \
  -H "Content-Type: application/json" \
  -d '{"session_id": "user-123"}'

Development

Local Testing

RAG Server Tests

cd services/rag_server
uv sync
.venv/bin/pytest -v

Project Structure

rag-docling/
├── docker-compose.yml          # Service orchestration
├── secrets/                    # Configuration secrets
├── services/
│   └── rag_server/            # RAG API backend
│       ├── main.py            # FastAPI app
│       ├── celery_app.py      # Celery configuration
│       ├── tasks.py           # Async document processing
│       ├── core_logic/        # RAG components
│       │   ├── embeddings.py           # OllamaEmbedding
│       │   ├── document_processor.py   # Docling + Contextual Retrieval
│       │   ├── chroma_manager.py       # VectorStoreIndex
│       │   ├── llm_handler.py          # LLM + prompts
│       │   ├── rag_pipeline.py         # Query pipeline
│       │   ├── hybrid_retriever.py     # BM25 + Vector + RRF
│       │   ├── chat_memory.py          # Redis-backed memory
│       │   └── progress_tracker.py     # Upload progress
│       ├── tests/             # Test suite (33 core + 27 evaluation)
│       └── pyproject.toml     # Dependencies
├── docs/                      # Documentation
│   ├── PHASE1_IMPLEMENTATION_SUMMARY.md
│   ├── PHASE2_IMPLEMENTATION_SUMMARY.md
│   ├── RAG_ACCURACY_IMPROVEMENT_PLAN_2025.md
│   ├── CONVERSATIONAL_RAG.md
│   └── evaluation/
└── README.md

API Documentation

Core Endpoints

POST /upload

Upload documents for indexing (async via Celery).

Request:

curl -X POST http://localhost:8001/upload \
  -F "files=@document.pdf"

Response:

{
  "status": "queued",
  "batch_id": "abc-123",
  "tasks": [
    {
      "task_id": "task-1",
      "filename": "document.pdf"
    }
  ]
}

Supported Formats: .txt, .md, .pdf, .docx, .pptx, .xlsx, .html, .htm, .asciidoc, .adoc

GET /tasks/{batch_id}/status

Get upload batch progress.

Response:

{
  "batch_id": "abc-123",
  "total": 2,
  "completed": 1,
  "total_chunks": 25,
  "completed_chunks": 10,
  "tasks": {
    "task-1": {
      "status": "completed",
      "filename": "document.pdf",
      "chunks": 5
    },
    "task-2": {
      "status": "processing",
      "filename": "document2.pdf"
    }
  }
}

POST /query

Search documents and get AI-generated answer.

Request:

{
  "query": "What is the main topic?",
  "session_id": "user-123",
  "n_results": 5
}

Response:

{
  "answer": "Based on the documents...",
  "sources": [
    {
      "document_name": "file.pdf",
      "excerpt": "The main topic is...",
      "full_text": "...",
      "path": "/docs/file.pdf",
      "distance": 0.15
    }
  ],
  "query": "What is the main topic?",
  "session_id": "user-123"
}

GET /documents

List all indexed documents (grouped by document_id).

Response:

{
  "documents": [
    {
      "id": "doc-123",
      "file_name": "document.pdf",
      "file_type": ".pdf",
      "path": "/docs",
      "chunks": 5,
      "file_size_bytes": 102400
    }
  ]
}

DELETE /documents/{document_id}

Delete a document and all its chunks.

Response:

{
  "status": "success",
  "message": "Document deleted successfully",
  "deleted_chunks": 5
}

GET /chat/history/{session_id}

Get conversation history for a session.

Response:

{
  "session_id": "user-123",
  "messages": [
    {
      "role": "user",
      "content": "What is the main topic?"
    },
    {
      "role": "assistant",
      "content": "Based on the documents..."
    }
  ]
}

POST /chat/clear

Clear chat history for a session.

Request:

{
  "session_id": "user-123"
}

GET /health

Health check endpoint.

GET /models/info

Get current model configuration.

Response:

{
  "llm_model": "gemma3:4b",
  "embedding_model": "nomic-embed-text:latest",
  "reranker_model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
  "ollama_url": "http://host.docker.internal:11434",
  "enable_reranker": true,
  "enable_hybrid_search": true,
  "enable_contextual_retrieval": true
}

Configuration

Environment Variables

RAG Server (docker-compose.yml):

Core Settings:

CHROMADB_URL: ChromaDB endpoint (default: http://chromadb:8000)
OLLAMA_URL: Ollama endpoint (default: http://host.docker.internal:11434)
EMBEDDING_MODEL: Embedding model (default: nomic-embed-text:latest)
LLM_MODEL: LLM model (default: gemma3:4b)
REDIS_URL: Redis endpoint (default: redis://redis:6379/0)

Retrieval Configuration:

RETRIEVAL_TOP_K: Number of nodes to retrieve before reranking (default: 10)
ENABLE_RERANKER: Enable cross-encoder reranking (default: true)
RERANKER_MODEL: Reranker model (default: cross-encoder/ms-marco-MiniLM-L-6-v2)

Phase 2 Features:

ENABLE_HYBRID_SEARCH: Enable BM25 + Vector search (default: true)
RRF_K: Reciprocal Rank Fusion k parameter (default: 60)
ENABLE_CONTEXTUAL_RETRIEVAL: Enable Anthropic contextual retrieval (default: true)

Logging:

LOG_LEVEL: Logging level (default: DEBUG for rag-server, INFO for celery-worker)

Docker Compose Networks

public: RAG server (exposed to host on port 8001)
private: ChromaDB, Redis, Celery Worker (internal only)

Phase 2 Implementation

Hybrid Search (BM25 + Vector + RRF)

Combines sparse (keyword) and dense (semantic) retrieval for 48% improvement in retrieval quality:

BM25 Retriever: Excels at exact keywords, IDs, names, abbreviations
Vector Retriever: Excels at semantic understanding, contextual meaning
RRF Fusion: Reciprocal Rank Fusion with k=60 (optimal per research)
- Formula: score = 1/(rank + k)
- No hyperparameter tuning required

How it works:

Query runs through both BM25 and Vector retrievers
Results merged using RRF (k=60)
Top-k combined results passed to reranker
Final top-n results returned as context

Auto-initialization:

BM25 index pre-loads at startup if documents exist
Auto-refreshes after document uploads/deletions

Contextual Retrieval (Anthropic Method)

Adds document-level context to chunks before embedding for 49% reduction in retrieval failures:

The Problem:

Original Chunk: "The three qualities are: natural aptitude, deep interest, and scope."
Query: "What makes great work?"
Result: ❌ MISSED (no direct term match)

The Solution:

Enhanced Chunk: "This section from Paul Graham's essay 'How to Do Great Work'
discusses the essential qualities for great work. The three qualities are:
natural aptitude, deep interest, and scope."

Implementation:

For each chunk, LLM generates 1-2 sentence context
Context prepended to original chunk text
Enhanced chunk embedded (context embedded once at indexing time)
Query-time: Zero overhead (context already embedded)

Performance:

49% reduction in retrieval failures
67% reduction when combined with reranking

See docs/PHASE2_IMPLEMENTATION_SUMMARY.md for complete details.

Known Issues & Fixes

DoclingReader/DoclingNodeParser Integration (FIXED)

Issue: Document upload fails with Pydantic validation error if DoclingReader export format is not specified.

Root Cause: DoclingReader defaults to MARKDOWN export, but DoclingNodeParser requires JSON format.

Fix: Always use JSON export in document_processor.py:

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)

ChromaDB Metadata Compatibility (FIXED)

Issue: ChromaDB rejects complex metadata types (lists, dicts) from Docling.

Fix: Filter metadata to flat types (str, int, float, bool, None) using clean_metadata_for_chroma() function before insertion.

CondensePlusContextChatEngine Integration (FIXED)

Issue: Custom hybrid retriever integration with CondensePlusContextChatEngine failing.

Fix: Pass retriever directly to CondensePlusContextChatEngine.from_defaults():

chat_engine = CondensePlusContextChatEngine.from_defaults(
    retriever=retriever,  # Not query_engine
    memory=memory,
    node_postprocessors=create_reranker_postprocessors(),
    ...
)

Troubleshooting

Ollama Connection Issues

# Verify Ollama is running
curl http://localhost:11434/api/tags

# Check models are available
ollama list

# Pull missing models
ollama pull gemma3:4b
ollama pull nomic-embed-text

ChromaDB Connection Issues

# Check ChromaDB logs
docker compose logs chromadb

# Restart ChromaDB
docker compose restart chromadb

View Service Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f rag-server
docker compose logs -f celery-worker
docker compose logs -f redis

Reset Database

# Stop services
docker compose down

# Remove ChromaDB volume
docker volume rm rag-docling_chroma_db_data

# Restart
docker compose up -d

Backup & Restore

ChromaDB Backup

Automated backup scripts are provided to protect against data loss:

# Manual backup to default location (./backups/chromadb/)
./scripts/backup_chromadb.sh

# Schedule daily backups at 2 AM (add to crontab)
crontab -e
# Add: 0 2 * * * cd /path/to/rag-docling && ./scripts/backup_chromadb.sh >> /var/log/chromadb_backup.log 2>&1

Features:

Timestamped backups (chromadb_backup_YYYYMMDD_HHMMSS.tar.gz)
30-day retention (automatically removes old backups)
Document count verification
Health check after restore

ChromaDB Restore

# List available backups
ls -lh ./backups/chromadb/

# Restore from specific backup
./scripts/restore_chromadb.sh ./backups/chromadb/chromadb_backup_20251013_020000.tar.gz

Note: Restore process stops services, replaces data, and verifies health after restart.

See scripts/README.md for complete documentation.

Testing

The project follows Test-Driven Development (TDD) methodology:

60 total tests (all passing)
- 33 RAG server core tests (Docling processing, embeddings, LlamaIndex integration, LLM, pipeline, API)
- 27 evaluation tests (RAGAS metrics, dataset loading, report generation)

Run all tests:

# RAG Server core tests
cd services/rag_server && .venv/bin/pytest -v

# RAG Server evaluation tests
cd services/rag_server && .venv/bin/pytest tests/evaluation/ -v

Implementation Details

Document Processing Pipeline

Upload: Documents uploaded via /upload endpoint
Async Processing: Celery worker processes each file
Docling Parsing: DoclingReader extracts text, tables, structure
Contextual Enhancement: LLM adds document context to each chunk
Structural Chunking: DoclingNodeParser creates nodes preserving hierarchy
Embedding: Each chunk embedded with context (nomic-embed-text)
Storage: Nodes stored in ChromaDB with metadata
BM25 Refresh: BM25 index updated with new nodes

Query Processing Pipeline

Query Received: User query + optional session_id
Memory Loading: Previous conversation loaded from Redis
Query Condensation: Standalone question created if conversational
Hybrid Retrieval:
- BM25 retriever finds keyword matches
- Vector retriever finds semantic matches
- RRF merges results (k=60)
Reranking: Cross-encoder reranks top-k nodes
Top-n Selection: Best 5-10 nodes selected as context
LLM Generation: Answer generated using context + memory
Memory Update: Conversation saved to Redis (1-hour TTL)
Response: Answer + sources returned

Key Features

Document Structure Preservation: Maintains headings, sections, tables as separate nodes
Hybrid Retrieval: BM25 (exact matching) + Vector (semantic understanding)
Contextual Enhancement: Document context embedded with chunks
Two-Stage Precision: Reranking refines hybrid search results
Conversational Memory: Redis-backed chat history with session management
Data Protection: Automated backups, startup persistence verification
Async Processing: Celery handles document uploads in background
Progress Tracking: Real-time upload progress via Redis

Roadmap

Completed (Phase 1 - 2025-10-13)

Redis-backed chat memory (conversations persist across restarts)
ChromaDB backup/restore automation
Reranker optimization (top-n selection)
Startup persistence verification
Dependency updates (ChromaDB 1.1.1, FastAPI 0.118.3, Redis 6.4.0)

See docs/PHASE1_IMPLEMENTATION_SUMMARY.md for details.

Completed (Phase 2 - 2025-10-14)

Hybrid search (BM25 + Vector + RRF) - 48% retrieval improvement
Contextual retrieval (Anthropic method) - 49% fewer failures
Auto-refresh BM25 after uploads/deletes
End-to-end testing and validation

See docs/PHASE2_IMPLEMENTATION_SUMMARY.md for details.

Planned (Phase 3+)

Parent document retrieval (sentence window)
Query fusion (multi-query generation)
DeepEval evaluation framework
Expanded golden QA dataset (50+ test cases)
Production monitoring dashboard
Support for additional file formats (CSV, JSON)
Multi-user support with authentication
Export search results

See docs/RAG_ACCURACY_IMPROVEMENT_PLAN_2025.md for future plans.

License

MIT License

Version

0.2.0 - Phase 2 Implementation

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.claude		.claude
docs		docs
sample_documents		sample_documents
scripts		scripts
secrets_template		secrets_template
services		services
.gitignore		.gitignore
.mcp.json		.mcp.json
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
redis.conf		redis.conf

gittycat/rag-docling

Folders and files

Latest commit

History

Repository files navigation

RAG System - Local Document Search with AI

Features

Phase 2: High-Impact Retrieval (Current)

Core Capabilities

Technology Stack

Architecture

Prerequisites

Quick Start

1. Clone and Setup

2. Configure Secrets (Optional)

3. Start Services

4. Verify Services

API Usage

Upload Documents

Check Upload Progress

Query Documents

List Documents

Delete Document

Chat History

Development

Local Testing

RAG Server Tests

Project Structure

API Documentation

Core Endpoints

POST /upload

GET /tasks/{batch_id}/status

POST /query

GET /documents

DELETE /documents/{document_id}

GET /chat/history/{session_id}

POST /chat/clear

GET /health

GET /models/info

Configuration

Environment Variables

Docker Compose Networks

Phase 2 Implementation

Hybrid Search (BM25 + Vector + RRF)

Contextual Retrieval (Anthropic Method)

Known Issues & Fixes

DoclingReader/DoclingNodeParser Integration (FIXED)

ChromaDB Metadata Compatibility (FIXED)

CondensePlusContextChatEngine Integration (FIXED)

Troubleshooting

Ollama Connection Issues

ChromaDB Connection Issues

View Service Logs

Reset Database

Backup & Restore

ChromaDB Backup

ChromaDB Restore

Testing

Implementation Details

Document Processing Pipeline

Query Processing Pipeline

Key Features

Roadmap

Completed (Phase 1 - 2025-10-13)

Completed (Phase 2 - 2025-10-14)

Planned (Phase 3+)

License

Version

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages