A simulator for exploring different batching strategies and load patterns in LLM inference.
# Install uv package manager
pip install uv
# Install dependencies
uv sync
In this simulator:
- A tick is the basic unit of time
prefill_time=2
means the prefill phase takes 2 ticksitl=1
(Inter-Token Latency) means generating each token takes 1 tick- Metrics are often reported per 1000 ticks for easier comparison
- Example:
460./1000.
request rate means 460 requests per 1000 ticks
Each example demonstrates different aspects of the simulator:
# Basic examples with simple configurations
uv run examples/batch_duration_demo.py
# Detailed metrics visualization
uv run examples/metrics_visualization.py
# Advanced batching strategies comparison
uv run examples/batching_strategies.py
# Queue growth analysis for long runs
uv run examples/queue_growth.py
- Multiple batching strategies (Static, In-Flight, Chunked Context)
- Various load generation patterns (Batch, Concurrent, Request Rate)
- Rich metrics visualization
- Configurable batch sizes and request parameters
- Queue growth analysis for long-running simulations
Basic batching strategy that only batches requests when all slots are empty.
# Configuration
engine = sim.Engine(
max_batch_size=4, # Maximum 4 requests in a batch
load_generator=BatchLoadGenerator(
initial_batch=100, # Send 100 requests at start
prefill_time=2, # Each prefill takes 2 ticks
itl=1, # Each token generation takes 1 tick
target_output_len_tokens=10 # Generate 10 tokens per request
),
batcher=StaticBatcher()
)
Performance:
Average E2E Latency: 58.16
Average TTFT: 52.80
Average ITL: 1.00
Requests/(1K ticks)/instance = 190.00
Tokens/(1K ticks)/instance = 1680.00
Allows mixing prefill and decode phases in the same batch.
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10
),
batcher=IFBatcher()
)
Performance:
Average E2E Latency: 58.44
Average TTFT: 52.90
Average ITL: 1.39
Requests/(1K ticks)/instance = 267.33 # 41% improvement over Static
Tokens/(1K ticks)/instance = 2376.24
Optimizes performance by separating prefill into chunks.
# Configuration
load_generator = BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10,
total_prefill_chunks=2 # Split prefill into 2 chunks
)
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcher()
)
Performance:
Average E2E Latency: 57.42
Average TTFT: 54.51
Average ITL: 1.14
Requests/(1K ticks)/instance = 310.00 # 15% improvement over basic IFB
Tokens/(1K ticks)/instance = 2730.00
Limits to one prefill request at a time for balanced compute/memory usage.
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcherWithOnePrefillOnly()
)
Performance:
Average E2E Latency: 55.94
Average TTFT: 52.13
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00 # Best throughput
Tokens/(1K ticks)/instance = 3170.00
Maintains a target level of concurrent requests.
# Configuration
load_generator = ConcurrentLoadGenerator(
target_concurrency=6, # Maintain 6 concurrent requests
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
Performance:
Average E2E Latency: 15.14
Average TTFT: 7.87
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00
Tokens/(1K ticks)/instance = 3170.00
Generates requests at a constant rate.
# Configuration
load_generator = RequestRateLoadGenerator(
request_rate=460./1000., # 460 requests per 1000 ticks
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
Performance:
Average E2E Latency: 17.66
Average TTFT: 11.03
Average ITL: 1.00
Requests/(1K ticks)/instance = 350.00
Tokens/(1K ticks)/instance = 3060.00
Compare performance between short (100 ticks) and long (10000 ticks) runs:
Request Rate Load Generator (460 requests/1000 ticks)
ββββββββββββββββββββ³βββββββββββββ³ββββββββββββββ³βββββββββββββ
β Metric β 100 ticks β 10000 ticks β Difference β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β Final Queue Size β 6 β 1138 β 1132 β
β Average TTFT β 11.03 β 1245.77 β 1234.75 β
β Average E2E β 17.66 β 1253.78 β 1236.12 β
ββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ
Concurrent Load Generator (6 concurrent requests)
ββββββββββββββββββββ³βββββββββββββ³ββββββββββββββ³βββββββββββββ
β Metric β 100 ticks β 10000 ticks β Difference β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β Final Queue Size β 2 β 2 β 0 β
β Average TTFT β 7.87 β 8.61 β 0.74 β
β Average E2E β 15.14 β 17.32 β 2.19 β
ββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ
Key observations:
- Request Rate generator shows significant queue growth over time
- Concurrent Load generator maintains stable queue size and latencies
- TTFT and E2E latency increase dramatically with queue growth
- One Prefill Per Batch strategy achieves best throughput (3170 tokens/1K ticks)
- IFB improves throughput by 41% over Static Batching
- Chunked Context further improves throughput by 15% over basic IFB
- E2E Latency: End-to-end latency for request completion (in ticks)
- TTFT: Time to first token (in ticks)
- ITL: Inter-token latency (ticks between tokens)
- Throughput: Requests and tokens processed per 1K ticks per instance
- Queue Size: Number of requests waiting to be processed
Note: This simulator currently models the core disaggregated inference flow of NVIDIA Dynamo (separate prefill and decode workers, global prefill queue, and request flow through both phases). However, it does not yet implement advanced features such as conditional disaggregation (router logic), KV transfer simulation, prefix cache hit logic, dynamic worker scaling, or distributed infrastructure (NATS/ETCD). These are planned for future development.
- Add conditional disaggregation (router) to decide if prefill is done locally or remotely
- Simulate KV cache transfer time and bandwidth between prefill and decode phases
- Implement prefix cache hit logic to sometimes skip or reduce prefill phase
- Allow dynamic scaling (add/remove) of prefill and decode workers during simulation
- (Advanced) Simulate KV descriptor/layout management and block merging
- (Advanced) Model distributed infrastructure (NATS/ETCD) for realistic coordination and scaling