A privacy-first, local RAG (Retrieval Augmented Generation) system that enables natural language search across your documents using AI through a REST API.
- Hybrid Search (BM25 + Vector + RRF): 48% improvement in retrieval quality
- Combines sparse (BM25) keyword search with dense (vector) semantic search
- Reciprocal Rank Fusion (k=60) for optimal result merging
- Excels at exact term matching and semantic understanding
- Contextual Retrieval (Anthropic Method): 49% reduction in retrieval failures
- LLM-generated document context prepended to chunks
- Zero query-time overhead (context embedded once at indexing)
- 67% reduction in failures when combined with reranking
- Natural Language Search: Ask questions in plain English and get AI-powered answers
- Advanced Document Processing: Powered by Docling with superior PDF/DOCX parsing, table extraction, and layout understanding
- Multi-Format Support: Process txt, md, pdf, docx, pptx, xlsx, html, and more
- Intelligent Chunking: Structural chunking that preserves document hierarchy (headings, sections, tables)
- Two-Stage Retrieval: Hybrid search (BM25 + Vector) → cross-encoder reranking
- Conversational Memory: Redis-backed chat history that persists across restarts with session management
- Data Protection: Automated ChromaDB backups with restore capability
- Local Deployment: All data stays on your machine
- Async Document Processing: Celery + Redis for background uploads with real-time progress tracking
- REST API: Clean API for integration with any frontend or application
- RAG Server: Python + FastAPI (REST API)
- Vector Database: ChromaDB with LlamaIndex integration
- LLM: Ollama (gemma3:4b for generation and evaluation)
- Embeddings: nomic-embed-text:latest via LlamaIndex OllamaEmbedding
- Document Processing: Docling + LlamaIndex DoclingReader/DoclingNodeParser
- Chunking: Docling structural chunking (preserves document hierarchy)
- Hybrid Search: BM25 (sparse) + Vector (dense) with Reciprocal Rank Fusion
- Reranking: SentenceTransformer cross-encoder (ms-marco-MiniLM-L-6-v2)
- Orchestration: Docker Compose
- Package Management: uv
- Task Queue: Celery + Redis (async document processing, chat history persistence, progress tracking)
┌─────────────────┐
│ Client Apps │ (Your web app, CLI, etc.)
│ (Future) │
└────────┬────────┘
│ HTTP/REST
│
┌────────▼────────┐
│ RAG Server │ (Port 8001, Public API)
│ (FastAPI) │
│ │
│ ┌──────────┐ │
│ │ Docling │ │ Document parsing
│ │ + │ │ + Contextual Retrieval
│ │LlamaIndex│ │ + Hybrid Search (BM25+Vector)
│ └──────────┘ │
└────────┬────────┘
│
┌────┴────┬────────────┬──────────┐
│ │ │ │
┌───▼───┐ ┌──▼──────┐ ┌──▼────┐ ┌─▼─────┐
│ChromaDB│ │ Ollama │ │ Redis │ │Celery │
│ (8000) │ │ (11434) │ │(6379) │ │Worker │
└────────┘ └─────────┘ └───────┘ └───────┘
Vector LLM + Chat Async
Storage Embeddings Memory Tasks
-
Docker & Docker Compose
- Docker Desktop or Docker Engine
- Docker Compose v2+
-
Ollama (running on host)
# Install Ollama curl https://ollama.ai/install.sh | sh # Pull required models ollama pull gemma3:4b # LLM for generation ollama pull nomic-embed-text # Embeddings
-
Python 3.13+ (for local development/testing)
# Install uv package manager curl -LsSf https://astral.sh/uv/install.sh | sh
cd /path/to/rag-docling
# Create secrets directory
mkdir -p secrets
# Add Ollama configuration if needed
echo "OLLAMA_HOST=http://host.docker.internal:11434" > secrets/ollama_config.env
docker compose up -d
This will start:
- RAG Server API on http://localhost:8001
- ChromaDB (internal)
- Redis (internal)
- Celery Worker (internal)
# Check health
curl http://localhost:8001/health
# Check models
curl http://localhost:8001/models/info
# Upload single file
curl -X POST http://localhost:8001/upload \
-F "files=@/path/to/document.pdf"
# Upload multiple files
curl -X POST http://localhost:8001/upload \
-F "files=@document1.pdf" \
-F "files=@document2.docx"
# Response includes batch_id for tracking
{
"status": "queued",
"batch_id": "abc-123",
"tasks": [
{"task_id": "task-1", "filename": "document1.pdf"}
]
}
# Get batch status
curl http://localhost:8001/tasks/{batch_id}/status
# Simple query (auto-generates session_id)
curl -X POST http://localhost:8001/query \
-H "Content-Type: application/json" \
-d '{
"query": "What is the main topic?",
"n_results": 5
}'
# Conversational query with session
curl -X POST http://localhost:8001/query \
-H "Content-Type: application/json" \
-d '{
"query": "Tell me more about that",
"session_id": "user-123",
"n_results": 5
}'
curl http://localhost:8001/documents
curl -X DELETE http://localhost:8001/documents/{document_id}
# Get conversation history
curl http://localhost:8001/chat/history/{session_id}
# Clear chat history
curl -X POST http://localhost:8001/chat/clear \
-H "Content-Type: application/json" \
-d '{"session_id": "user-123"}'
cd services/rag_server
uv sync
.venv/bin/pytest -v
rag-docling/
├── docker-compose.yml # Service orchestration
├── secrets/ # Configuration secrets
├── services/
│ └── rag_server/ # RAG API backend
│ ├── main.py # FastAPI app
│ ├── celery_app.py # Celery configuration
│ ├── tasks.py # Async document processing
│ ├── core_logic/ # RAG components
│ │ ├── embeddings.py # OllamaEmbedding
│ │ ├── document_processor.py # Docling + Contextual Retrieval
│ │ ├── chroma_manager.py # VectorStoreIndex
│ │ ├── llm_handler.py # LLM + prompts
│ │ ├── rag_pipeline.py # Query pipeline
│ │ ├── hybrid_retriever.py # BM25 + Vector + RRF
│ │ ├── chat_memory.py # Redis-backed memory
│ │ └── progress_tracker.py # Upload progress
│ ├── tests/ # Test suite (33 core + 27 evaluation)
│ └── pyproject.toml # Dependencies
├── docs/ # Documentation
│ ├── PHASE1_IMPLEMENTATION_SUMMARY.md
│ ├── PHASE2_IMPLEMENTATION_SUMMARY.md
│ ├── RAG_ACCURACY_IMPROVEMENT_PLAN_2025.md
│ ├── CONVERSATIONAL_RAG.md
│ └── evaluation/
└── README.md
Upload documents for indexing (async via Celery).
Request:
curl -X POST http://localhost:8001/upload \
-F "files=@document.pdf"
Response:
{
"status": "queued",
"batch_id": "abc-123",
"tasks": [
{
"task_id": "task-1",
"filename": "document.pdf"
}
]
}
Supported Formats: .txt
, .md
, .pdf
, .docx
, .pptx
, .xlsx
, .html
, .htm
, .asciidoc
, .adoc
Get upload batch progress.
Response:
{
"batch_id": "abc-123",
"total": 2,
"completed": 1,
"total_chunks": 25,
"completed_chunks": 10,
"tasks": {
"task-1": {
"status": "completed",
"filename": "document.pdf",
"chunks": 5
},
"task-2": {
"status": "processing",
"filename": "document2.pdf"
}
}
}
Search documents and get AI-generated answer.
Request:
{
"query": "What is the main topic?",
"session_id": "user-123",
"n_results": 5
}
Response:
{
"answer": "Based on the documents...",
"sources": [
{
"document_name": "file.pdf",
"excerpt": "The main topic is...",
"full_text": "...",
"path": "/docs/file.pdf",
"distance": 0.15
}
],
"query": "What is the main topic?",
"session_id": "user-123"
}
List all indexed documents (grouped by document_id).
Response:
{
"documents": [
{
"id": "doc-123",
"file_name": "document.pdf",
"file_type": ".pdf",
"path": "/docs",
"chunks": 5,
"file_size_bytes": 102400
}
]
}
Delete a document and all its chunks.
Response:
{
"status": "success",
"message": "Document deleted successfully",
"deleted_chunks": 5
}
Get conversation history for a session.
Response:
{
"session_id": "user-123",
"messages": [
{
"role": "user",
"content": "What is the main topic?"
},
{
"role": "assistant",
"content": "Based on the documents..."
}
]
}
Clear chat history for a session.
Request:
{
"session_id": "user-123"
}
Health check endpoint.
Get current model configuration.
Response:
{
"llm_model": "gemma3:4b",
"embedding_model": "nomic-embed-text:latest",
"reranker_model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
"ollama_url": "http://host.docker.internal:11434",
"enable_reranker": true,
"enable_hybrid_search": true,
"enable_contextual_retrieval": true
}
RAG Server (docker-compose.yml):
Core Settings:
CHROMADB_URL
: ChromaDB endpoint (default:http://chromadb:8000
)OLLAMA_URL
: Ollama endpoint (default:http://host.docker.internal:11434
)EMBEDDING_MODEL
: Embedding model (default:nomic-embed-text:latest
)LLM_MODEL
: LLM model (default:gemma3:4b
)REDIS_URL
: Redis endpoint (default:redis://redis:6379/0
)
Retrieval Configuration:
RETRIEVAL_TOP_K
: Number of nodes to retrieve before reranking (default:10
)ENABLE_RERANKER
: Enable cross-encoder reranking (default:true
)RERANKER_MODEL
: Reranker model (default:cross-encoder/ms-marco-MiniLM-L-6-v2
)
Phase 2 Features:
ENABLE_HYBRID_SEARCH
: Enable BM25 + Vector search (default:true
)RRF_K
: Reciprocal Rank Fusion k parameter (default:60
)ENABLE_CONTEXTUAL_RETRIEVAL
: Enable Anthropic contextual retrieval (default:true
)
Logging:
LOG_LEVEL
: Logging level (default:DEBUG
for rag-server,INFO
for celery-worker)
- public: RAG server (exposed to host on port 8001)
- private: ChromaDB, Redis, Celery Worker (internal only)
Combines sparse (keyword) and dense (semantic) retrieval for 48% improvement in retrieval quality:
- BM25 Retriever: Excels at exact keywords, IDs, names, abbreviations
- Vector Retriever: Excels at semantic understanding, contextual meaning
- RRF Fusion: Reciprocal Rank Fusion with k=60 (optimal per research)
- Formula:
score = 1/(rank + k)
- No hyperparameter tuning required
- Formula:
How it works:
- Query runs through both BM25 and Vector retrievers
- Results merged using RRF (k=60)
- Top-k combined results passed to reranker
- Final top-n results returned as context
Auto-initialization:
- BM25 index pre-loads at startup if documents exist
- Auto-refreshes after document uploads/deletions
Adds document-level context to chunks before embedding for 49% reduction in retrieval failures:
The Problem:
Original Chunk: "The three qualities are: natural aptitude, deep interest, and scope."
Query: "What makes great work?"
Result: ❌ MISSED (no direct term match)
The Solution:
Enhanced Chunk: "This section from Paul Graham's essay 'How to Do Great Work'
discusses the essential qualities for great work. The three qualities are:
natural aptitude, deep interest, and scope."
Implementation:
- For each chunk, LLM generates 1-2 sentence context
- Context prepended to original chunk text
- Enhanced chunk embedded (context embedded once at indexing time)
- Query-time: Zero overhead (context already embedded)
Performance:
- 49% reduction in retrieval failures
- 67% reduction when combined with reranking
See docs/PHASE2_IMPLEMENTATION_SUMMARY.md
for complete details.
Issue: Document upload fails with Pydantic validation error if DoclingReader export format is not specified.
Root Cause: DoclingReader defaults to MARKDOWN export, but DoclingNodeParser requires JSON format.
Fix: Always use JSON export in document_processor.py
:
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
Issue: ChromaDB rejects complex metadata types (lists, dicts) from Docling.
Fix: Filter metadata to flat types (str, int, float, bool, None) using clean_metadata_for_chroma()
function before insertion.
Issue: Custom hybrid retriever integration with CondensePlusContextChatEngine failing.
Fix: Pass retriever directly to CondensePlusContextChatEngine.from_defaults()
:
chat_engine = CondensePlusContextChatEngine.from_defaults(
retriever=retriever, # Not query_engine
memory=memory,
node_postprocessors=create_reranker_postprocessors(),
...
)
# Verify Ollama is running
curl http://localhost:11434/api/tags
# Check models are available
ollama list
# Pull missing models
ollama pull gemma3:4b
ollama pull nomic-embed-text
# Check ChromaDB logs
docker compose logs chromadb
# Restart ChromaDB
docker compose restart chromadb
# All services
docker compose logs -f
# Specific service
docker compose logs -f rag-server
docker compose logs -f celery-worker
docker compose logs -f redis
# Stop services
docker compose down
# Remove ChromaDB volume
docker volume rm rag-docling_chroma_db_data
# Restart
docker compose up -d
Automated backup scripts are provided to protect against data loss:
# Manual backup to default location (./backups/chromadb/)
./scripts/backup_chromadb.sh
# Schedule daily backups at 2 AM (add to crontab)
crontab -e
# Add: 0 2 * * * cd /path/to/rag-docling && ./scripts/backup_chromadb.sh >> /var/log/chromadb_backup.log 2>&1
Features:
- Timestamped backups (
chromadb_backup_YYYYMMDD_HHMMSS.tar.gz
) - 30-day retention (automatically removes old backups)
- Document count verification
- Health check after restore
# List available backups
ls -lh ./backups/chromadb/
# Restore from specific backup
./scripts/restore_chromadb.sh ./backups/chromadb/chromadb_backup_20251013_020000.tar.gz
Note: Restore process stops services, replaces data, and verifies health after restart.
See scripts/README.md
for complete documentation.
The project follows Test-Driven Development (TDD) methodology:
- 60 total tests (all passing)
- 33 RAG server core tests (Docling processing, embeddings, LlamaIndex integration, LLM, pipeline, API)
- 27 evaluation tests (RAGAS metrics, dataset loading, report generation)
Run all tests:
# RAG Server core tests
cd services/rag_server && .venv/bin/pytest -v
# RAG Server evaluation tests
cd services/rag_server && .venv/bin/pytest tests/evaluation/ -v
- Upload: Documents uploaded via
/upload
endpoint - Async Processing: Celery worker processes each file
- Docling Parsing: DoclingReader extracts text, tables, structure
- Contextual Enhancement: LLM adds document context to each chunk
- Structural Chunking: DoclingNodeParser creates nodes preserving hierarchy
- Embedding: Each chunk embedded with context (nomic-embed-text)
- Storage: Nodes stored in ChromaDB with metadata
- BM25 Refresh: BM25 index updated with new nodes
- Query Received: User query + optional session_id
- Memory Loading: Previous conversation loaded from Redis
- Query Condensation: Standalone question created if conversational
- Hybrid Retrieval:
- BM25 retriever finds keyword matches
- Vector retriever finds semantic matches
- RRF merges results (k=60)
- Reranking: Cross-encoder reranks top-k nodes
- Top-n Selection: Best 5-10 nodes selected as context
- LLM Generation: Answer generated using context + memory
- Memory Update: Conversation saved to Redis (1-hour TTL)
- Response: Answer + sources returned
- Document Structure Preservation: Maintains headings, sections, tables as separate nodes
- Hybrid Retrieval: BM25 (exact matching) + Vector (semantic understanding)
- Contextual Enhancement: Document context embedded with chunks
- Two-Stage Precision: Reranking refines hybrid search results
- Conversational Memory: Redis-backed chat history with session management
- Data Protection: Automated backups, startup persistence verification
- Async Processing: Celery handles document uploads in background
- Progress Tracking: Real-time upload progress via Redis
- Redis-backed chat memory (conversations persist across restarts)
- ChromaDB backup/restore automation
- Reranker optimization (top-n selection)
- Startup persistence verification
- Dependency updates (ChromaDB 1.1.1, FastAPI 0.118.3, Redis 6.4.0)
See docs/PHASE1_IMPLEMENTATION_SUMMARY.md
for details.
- Hybrid search (BM25 + Vector + RRF) - 48% retrieval improvement
- Contextual retrieval (Anthropic method) - 49% fewer failures
- Auto-refresh BM25 after uploads/deletes
- End-to-end testing and validation
See docs/PHASE2_IMPLEMENTATION_SUMMARY.md
for details.
- Parent document retrieval (sentence window)
- Query fusion (multi-query generation)
- DeepEval evaluation framework
- Expanded golden QA dataset (50+ test cases)
- Production monitoring dashboard
- Support for additional file formats (CSV, JSON)
- Multi-user support with authentication
- Export search results
See docs/RAG_ACCURACY_IMPROVEMENT_PLAN_2025.md
for future plans.
MIT License
0.2.0 - Phase 2 Implementation