[Core] Reuse ZMQ to level trigger local ShmRingBuffer events. #24618

noobpwnftw · 2025-09-11T01:23:15Z

Purpose

Avoids busy polling and no wake up latency from sleeps.

Test Plan

None

Test Result

Works

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-09-11T01:23:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request replaces a busy-polling mechanism with a more efficient event-driven approach using ZMQ for local shared memory communication. This is a good improvement to reduce CPU usage when idle. The implementation is mostly correct, but there is an issue in acquire_read where the poll timeout does not respect the function's overall timeout parameter, which could lead to unnecessary delays.

vllm/distributed/device_communicators/shm_broadcast.py

noobpwnftw · 2025-09-14T22:26:03Z

This is a better fix for #16226.

@chaunceyjiang @p12tic

chaunceyjiang

Could you provide a benchmark test?
It seems that this PR causes performance degradation under high throughput conditions.

noobpwnftw · 2025-09-15T04:08:08Z

Could you provide a benchmark test? It seems that this PR causes performance degradation under high throughput conditions.

I don’t think that’s the case.

First, this code path is not hot — it runs once per worker schedule/output in MultiprocExecutor. The busy-poll existed only for local IPC (e.g. TP>1 on the same node). For remote sockets it’s already orders of magnitude slower, yet that’s not raised as a concern.

Second, the original code falls back to ZMQ for payloads larger than the shm buffer. In practice (e.g. a 16-GPU node), small chunks go through shm to avoid 16 redundant copies, while large chunks are still copied wholesale if they exceed VLLM_MQ_MAX_CHUNK_BYTES_MB=16. The supposed “optimization” is either negligible or backwards — and in any case, such behavior is unchanged.

Third, under true high-throughput conditions: if shm is ready and the thread wins the ring-buffer race, there’s zero extra polling. The only cost is a single pipe read/write per call, ~20 µs. That’s not even comparable to sched_yield(), let alone the actual overheads of inference and device communication.

Lastly, the old approach injected up to 100 ms latency during idle → active transitions due to time.sleep. This PR removes that cliff entirely.

Given all this, I’m not seeing how “performance degradation” applies here. Could you clarify the specific workload or measurements where you observed it? Without that context, the evidence points to this change improving latency without impacting throughput.

noobpwnftw · 2025-09-15T04:35:56Z

For completeness: vLLM currently routes small items through SHM and large items through ZMQ — which is inverted from the more typical design (ZMQ for small control messages, SHM for large bulk transfers). My change doesn’t alter that; it only replaces busy-poll/sleep with a level-triggered wake. Any throughput concerns around very large broadcasts are therefore a separate threshold/transport discussion, orthogonal to this PR. That said, however, I believe you’d be better off with pure ZMQ IPC here, or at least flipping the policy so that small items use ZMQ and large payloads use SHM.

chaunceyjiang · 2025-09-15T05:54:21Z

For completeness: vLLM currently routes small items through SHM and large items through ZMQ — which is inverted from the more typical design (ZMQ for small control messages, SHM for large bulk transfers). My change doesn’t alter that; it only replaces busy-poll/sleep with a level-triggered wake. Any throughput concerns around very large broadcasts are therefore a separate threshold/transport discussion, orthogonal to this PR. That said, however, I believe you’d be better off with pure ZMQ IPC here, or at least flipping the policy so that small items use ZMQ and large payloads use SHM.

Thank you for the clarification～

/cc @njhill PTAL.

chaunceyjiang · 2025-09-15T10:41:14Z

vllm/distributed/device_communicators/shm_broadcast.py

I remember that during execute_model, it calls broadcast_tensor_dict to broadcast the tensor. I’m not sure if this affects anything.

The actual tensors are transferred via Torch C bindings, while the Python side only broadcasts metadata through SHM and later gathers outputs post-inference.

You raise a good point about potential impact on end-to-end latency. However, this path is not inside the per-token inference loop, and the added cost in Python is negligible (on the order of ~100 µs ≈ 0.1 ms, which is comparable to ~50 lines of Python). Meanwhile, CPU utilization drops from constant ~100% to ~8%.

Given that, I can’t construct a realistic scenario where this overhead becomes a bottleneck. For it to matter, scheduler–worker IPC would have to dominate performance over both device communication and Python interpreter overhead, which doesn’t seem plausible.

njhill

Thanks @noobpwnftw, I agree that we should do something along these lines. And had also thought similarly that it's kind of inverse of how shm would usually be used.

The reason for using shm in this way is to minimize latency multi-worker case, it would be good to benchmark TP=8 with a relatively small model and low concurrency to see if there's any impact from this change. I will do that too if I get a chance.

njhill · 2025-09-15T21:37:41Z

vllm/distributed/device_communicators/shm_broadcast.py

Why suppressing error here?

Timed out before anything became available is returned as an error.

No it just returns zero in this case

vllm/distributed/device_communicators/shm_broadcast.py

njhill

I'm good with this if benchmark shows no worse...

vllm/distributed/device_communicators/shm_broadcast.py

njhill · 2025-09-18T22:44:36Z

@noobpwnftw I ran some benchmarks of very low-latency case, unfortunately this does perform quite a bit worse:

On 2x H100:

vllm serve meta-llama/Llama-3.2-1B-Instruct --disable-log-requests --uvicorn-log-level=error --tensor-parallel-size 2

vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --dataset-name random \
    --random-input-len 512 \
    --random-output-len 2000 \
    --ignore-eos \
    --num-prompts 12 \
    --max-concurrency 2 \
    --seed 42

Before:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  23.64     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.51      
Output token throughput (tok/s):         1015.39   
Peak output token throughput (tok/s):    1030.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1274.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          23.95     
Median TTFT (ms):                        23.86     
P99 TTFT (ms):                           26.52     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.96      
Median TPOT (ms):                        1.96      
P99 TPOT (ms):                           1.96      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.96      
Median ITL (ms):                         1.94      
P99 ITL (ms):                            2.69      
==================================================

This PR:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  30.52     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.39      
Output token throughput (tok/s):         786.36    
Peak output token throughput (tok/s):    836.00    
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          987.27    
---------------Time to First Token----------------
Mean TTFT (ms):                          26.53     
Median TTFT (ms):                        25.41     
P99 TTFT (ms):                           32.97     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.53      
Median TPOT (ms):                        2.53      
P99 TPOT (ms):                           2.67      
---------------Inter-token Latency----------------
Mean ITL (ms):                           2.53      
Median ITL (ms):                         2.54      
P99 ITL (ms):                            2.95      
==================================================

With async scheduling enabled the difference is much smaller but still noticeable:

Before:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  21.50     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.56      
Output token throughput (tok/s):         1116.40   
Peak output token throughput (tok/s):    1130.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1401.64   
---------------Time to First Token----------------
Mean TTFT (ms):                          17.83     
Median TTFT (ms):                        17.21     
P99 TTFT (ms):                           21.59     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.78      
Median TPOT (ms):                        1.78      
P99 TPOT (ms):                           1.79      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.78      
Median ITL (ms):                         1.77      
P99 ITL (ms):                            2.23      
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  22.25     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.54      
Output token throughput (tok/s):         1078.77   
Peak output token throughput (tok/s):    1118.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1354.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          20.19     
Median TTFT (ms):                        18.82     
P99 TTFT (ms):                           24.89     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.84      
Median TPOT (ms):                        1.85      
P99 TPOT (ms):                           1.86      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.84      
Median ITL (ms):                         1.83      
P99 ITL (ms):                            2.24      
==================================================

I think we could still try to fall-back to polling the socket after spinning for some number of milliseconds (> step time, though that can vary a lot depending on the model and batch size, etc) i.e. when "idle".

Or even experiment with something adaptive where based on predicted forward pass time we only start spinning after a small delay (i.e. in cases where the forward pass is longer)

under true high-throughput conditions: if shm is ready and the thread wins the ring-buffer race, there’s zero extra polling.

I don't think this is related to throughput, it's related to the step time.

noobpwnftw · 2025-09-19T01:20:22Z

@njhill This is so unfortunate. Could try copy=False on the recv path to see if it improves anything.

Now I have written 2 variants:

Gate this entirely under VLLM_SLEEP_WHEN_IDLE, updated in this PR.
An adaptive tick mechanism under a separate ZMQ socket but with somewhat inconsistent idle → active transition latency(<=100ms). See here: main...noobpwnftw:vllm:zmq_tick2.

I don't think this is related to throughput, it's related to the step time.

My point is that if this happens so frequently within Python, then there is probably somewhere else to sqeeze the extra ms.

njhill · 2025-09-19T19:23:24Z

We could also explore the possibility of enabling this when async scheduling is enabled which we want to make the default soon anyhow.

noobpwnftw · 2025-09-27T02:56:05Z

I think this version should be ok.

Avoids busy polling and reduce wake up latency from sleeps. Signed-off-by: noobpwnftw <guo.bojun@gmail.com>

njhill · 2025-09-27T20:42:12Z

Thanks @noobpwnftw ... could you explain why a new/separate socket is needed?

noobpwnftw · 2025-09-27T22:53:22Z

To avoid overhead, subscribers now sleeps only after idle, wake ticks are not always consumed per handoff. A separate socket is therefore needed to correctly read overflowed payloads.

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

vllm/distributed/device_communicators/shm_broadcast.py Outdated Show resolved Hide resolved

noobpwnftw force-pushed the zmq_tick branch 5 times, most recently from b8b5133 to 8a087e6 Compare September 12, 2025 11:37

chaunceyjiang reviewed Sep 15, 2025

View reviewed changes

noobpwnftw force-pushed the zmq_tick branch from 4f1de7f to d9eb852 Compare September 15, 2025 04:29

noobpwnftw force-pushed the zmq_tick branch 3 times, most recently from 50fb107 to d1c72ad Compare September 15, 2025 05:51

noobpwnftw force-pushed the zmq_tick branch 2 times, most recently from b7f63e7 to dedfbd9 Compare September 15, 2025 06:14

chaunceyjiang reviewed Sep 15, 2025

View reviewed changes

njhill reviewed Sep 15, 2025

View reviewed changes

noobpwnftw force-pushed the zmq_tick branch from dedfbd9 to c82bad1 Compare September 16, 2025 23:05

njhill reviewed Sep 17, 2025

View reviewed changes

vllm/distributed/device_communicators/shm_broadcast.py Outdated Show resolved Hide resolved

njhill reviewed Sep 17, 2025

View reviewed changes

vllm/distributed/device_communicators/shm_broadcast.py Outdated Show resolved Hide resolved

njhill changed the title ~~Reuse ZMQ to level trigger local ShmRingBuffer events.~~ [Core] Reuse ZMQ to level trigger local ShmRingBuffer events. Sep 17, 2025

noobpwnftw force-pushed the zmq_tick branch from 960cba9 to 830dcda Compare September 17, 2025 22:32

noobpwnftw force-pushed the zmq_tick branch from 830dcda to f04cf05 Compare September 19, 2025 00:03

noobpwnftw force-pushed the zmq_tick branch from f04cf05 to d02ce9b Compare September 27, 2025 01:09

noobpwnftw force-pushed the zmq_tick branch 3 times, most recently from 36d9c3f to a762ca0 Compare September 27, 2025 01:37

Reuse ZMQ to level trigger local ShmRingBuffer events.

1b53c17

Avoids busy polling and reduce wake up latency from sleeps. Signed-off-by: noobpwnftw <guo.bojun@gmail.com>

noobpwnftw force-pushed the zmq_tick branch from a762ca0 to 1b53c17 Compare September 27, 2025 03:11

Uh oh!

[Core] Reuse ZMQ to level trigger local ShmRingBuffer events. #24618

Are you sure you want to change the base?

[Core] Reuse ZMQ to level trigger local ShmRingBuffer events. #24618

Conversation

noobpwnftw commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Sep 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

noobpwnftw commented Sep 14, 2025

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

noobpwnftw commented Sep 15, 2025

Uh oh!

noobpwnftw commented Sep 15, 2025

Uh oh!

chaunceyjiang commented Sep 15, 2025

Uh oh!

chaunceyjiang Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

noobpwnftw Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

noobpwnftw Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

njhill commented Sep 18, 2025

Uh oh!

noobpwnftw commented Sep 19, 2025

Uh oh!

njhill commented Sep 19, 2025

Uh oh!

noobpwnftw commented Sep 27, 2025

Uh oh!

njhill commented Sep 27, 2025

Uh oh!

noobpwnftw commented Sep 27, 2025

Uh oh!

Uh oh!

noobpwnftw commented Sep 11, 2025 •

edited by github-actions bot

Loading