You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project uses a combination of [Uvicorn](https://www.uvicorn.org/), [FastAPI](https://fastapi.tiangolo.com/) (Python) and [Docker](https://www.docker.com/) to provide a reliable REST API for testing [Microsoft's BitNet inference framework](https://github.com/microsoft/BitNet) out locally, specifically their [BitNet b1.58 2B4T](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T) model!
It supports running the inference framework, running BitNet model benchmarks and calculating BitNet model perplexity values.
5
+
This project provides a robust REST API built with FastAPI and Docker to manage and interact with `llama.cpp`-based BitNet model instances. It allows developers and researchers to programmatically control `llama-cli` processes for automated testing, benchmarking, and interactive chat sessions.
6
6
7
-
It's offers the same functionality as the [Electron-BitNet](https://github.com/grctest/Electron-BitNet) project, however it does so through a REST API which devs/researchers can use to automate testing/benchmarking of 1-bit BitNet models!
7
+
It serves as a backend replacement for the [Electron-BitNet](https://github.com/grctest/Electron-BitNet) project, offering enhanced performance, scalability, and persistent chat sessions.
8
8
9
-
## Setup instructions
9
+
## Key Features
10
10
11
-
If running in dev mode, run Docker Desktop on windows to initialize docker in WSL2.
11
+
***Session Management**: Start, stop, and check the status of multiple persistent `llama-cli` and `llama-server` session based chats.
12
+
***Batch Operations**: Initialize, shut down, and chat with multiple instances in a single API call.
13
+
***Interactive Chat**: Send prompts to running bitnet sessions and receive cleaned model responses.
14
+
***Model Benchmarking**: Programmatically run benchmarks and calculate perplexity on GGUF models.
15
+
***Resource Estimation**: Estimate maximum server capacity based on available system RAM and CPU threads.
16
+
***VS Code Integration**: Connects directly to GitHub Copilot Chat as a tool via the Model Context Protocol.
17
+
***Automatic API Docs**: Interactive API documentation powered by Swagger UI and ReDoc.
0 commit comments