Skip to content

The framework for posting a more modern cuda image for llama.cpp with cuda13 for just newer cards with RPC support. Started as just learning how to compile llama.cpp custom.

Notifications You must be signed in to change notification settings

HurbaLurba/quick-llama.cpp-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLaMA.cpp Enhanced Docker Image for Modern GPUs

🎯 What is this?
This is an enhanced version of the official llama.cpp Docker image, specifically optimized for modern NVIDIA GPUs (RTX 30/40/50 series). It upgrades CUDA from 12.4.0 to 13.0.1 and adds RPC backend support for distributed processing.

πŸš€ Why use this instead of the official image?

  • Better RTX 40/50 series support with CUDA 13.0.1
  • RPC backend for distributed inference across multiple machines
  • Smaller, faster - only targets modern GPU architectures (no legacy bloat)
  • Same functionality as official ghcr.io/ggml-org/llama.cpp:full-cuda but enhanced

πŸ“¦ Ready to use - No building required!
Available on Docker Hub: philglod/llamacpp-cuda13-modern-full:latest

πŸš€ Quick Start (Most Users Start Here!)

What You Need

Get Started in 2 Minutes

1. Pull the image:

docker pull philglod/llamacpp-cuda13-modern-full:latest

2. Test it works:

docker run --rm --gpus all philglod/llamacpp-cuda13-modern-full:latest --server --help | grep -i cuda

You should see your GPU detected like: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9

3. Start using it! Choose what you want to do:

🌐 Run a Web Server

docker run --rm --gpus all -p 8080:8080 \
  philglod/llamacpp-cuda13-modern-full:latest \
  --server --host 0.0.0.0 --port 8080

Then visit http://localhost:8080 for the web interface!

πŸ“₯ Download & Convert a Model from HuggingFace

mkdir ./models
docker run --rm --gpus all -v $(pwd)/models:/models \
  philglod/llamacpp-cuda13-modern-full:latest \
  --convert --hf-repo microsoft/Phi-3-mini-4k-instruct --outtype f16

πŸš€ Run a Complete AI Server

# After converting a model (like above), run a full server:
docker run -d --name my-ai-server --gpus all -p 8080:8080 -v $(pwd)/models:/models \
  philglod/llamacpp-cuda13-modern-full:latest \
  --server --host 0.0.0.0 --port 8080 \
  --model /models/Phi-3-mini-4k-instruct-f16.gguf \
  --ctx-size 4096 --n-gpu-layers 999

Access web UI at http://localhost:8080 or API at http://localhost:8080/v1/chat/completions

πŸ’‘ What Can This Do?

This image includes everything you need for AI model work:

  • 🌐 Web Server - Run models with a web interface
  • πŸ”„ Model Conversion - Convert HuggingFace models to llama.cpp format
  • πŸ“Š Benchmarking - Test your GPU performance
  • πŸ’¬ Interactive Chat - Talk to models directly
  • πŸ”§ All Tools - Complete llama.cpp toolkit included

🎯 Who Should Use This?

βœ… Perfect for you if:

  • You have RTX 30/40/50 series GPU
  • You want the latest CUDA performance improvements
  • You need RPC support for distributed setups
  • You want a ready-to-use solution (no building required)

❌ Not for you if:

  • You have older GPUs (GTX 10/20 series, Tesla K80, etc.)
  • You need to customize the build extensively
  • You're fine with the official CUDA 12.4.0 images

πŸ”„ Alternative: Use Official Images

For older GPUs or standard setups: ghcr.io/ggml-org/llama.cpp:full-cuda

πŸ“‹ More Usage Examples

Interactive Chat with a Model

docker run --rm -it --gpus all -v $(pwd)/models:/models \
  philglod/llamacpp-cuda13-modern-full:latest \
  --run -m /models/your-model.gguf -p "Hello, how are you?"

Benchmark Your GPU

docker run --rm --gpus all -v $(pwd)/models:/models \
  philglod/llamacpp-cuda13-modern-full:latest \
  --bench -m /models/your-model.gguf

Convert Your Own Model

docker run --rm --gpus all -v $(pwd)/my-model:/input -v $(pwd)/converted:/output \
  philglod/llamacpp-cuda13-modern-full:latest \
  --convert --outtype f16 /input/ --output-dir /output/

πŸ”§ GPU Compatibility

βœ… Supported (Modern GPUs Only)

Series Examples CUDA Compute
RTX 30 3060, 3070, 3080, 3090 8.6
RTX 40 4060, 4070, 4080, 4090 8.9
RTX 50 5090, etc. 9.0

❌ Not Supported (Use Official Images Instead)

  • GTX 10/20 series (Pascal, Turing)
  • Tesla K80, P100, V100 (older data center GPUs)
  • Any GPU with compute capability below 8.6

πŸ—οΈ For Developers: Building from Source

Most users don't need this section! Only if you want to customize the build.

Prerequisites

  • Docker with GPU support
  • Git
  • This repository cloned locally

Build Process

# Clone llama.cpp source
git submodule update --init --recursive

# Build the image
docker build -t my-custom-llamacpp:latest --target full -f docker/cuda-13.0.1-custom.Dockerfile .

# Test it
docker run --rm --gpus all my-custom-llamacpp:latest --help

Publishing Your Own Version

# Tag for Docker Hub
docker tag my-custom-llamacpp:latest YOUR_USERNAME/llamacpp-custom:latest

# Push to Docker Hub
docker login
docker push YOUR_USERNAME/llamacpp-custom:latest

πŸ” Technical Details

Custom CMake Configuration

Built with optimized flags for modern GPUs:

-DGGML_CUDA=ON                    # CUDA support
-DGGML_FORCE_CUBLAS=ON            # Force cuBLAS usage
-DGGML_RPC=ON                     # RPC backend support
-DCMAKE_CUDA_ARCHITECTURES="86;89;90"  # Modern GPUs only

Docker Hub Information

System Requirements

  • NVIDIA GPU with compute capability 8.6+
  • NVIDIA Container Toolkit installed
  • Docker with GPU support enabled
  • Sufficient VRAM for your target models

πŸ†˜ Troubleshooting

GPU Not Detected?

# Check if NVIDIA Container Toolkit is working:
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi

Image Won't Start?

Make sure you're using --gpus all flag and have a compatible GPU (RTX 30/40/50 series)

Performance Issues?

This image is optimized for modern GPUs. For older GPUs, use the official images instead.

πŸ“œ License & Credits

Based on the official llama.cpp project. See the llama.cpp repository for licensing terms.

Special thanks to the llama.cpp team for the excellent foundation this build enhances.

About

The framework for posting a more modern cuda image for llama.cpp with cuda13 for just newer cards with RPC support. Started as just learning how to compile llama.cpp custom.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published