Skip to content

Conversation

noobpwnftw
Copy link

@noobpwnftw noobpwnftw commented Sep 11, 2025

Purpose

Avoids busy polling and no wake up latency from sleeps.

Test Plan

None

Test Result

Works


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces a busy-polling mechanism with a more efficient event-driven approach using ZMQ for local shared memory communication. This is a good improvement to reduce CPU usage when idle. The implementation is mostly correct, but there is an issue in acquire_read where the poll timeout does not respect the function's overall timeout parameter, which could lead to unnecessary delays.

@noobpwnftw noobpwnftw force-pushed the zmq_tick branch 5 times, most recently from b8b5133 to 8a087e6 Compare September 12, 2025 11:37
@noobpwnftw
Copy link
Author

This is a better fix for #16226.

@chaunceyjiang @p12tic

Copy link
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a benchmark test?
It seems that this PR causes performance degradation under high throughput conditions.

@noobpwnftw
Copy link
Author

Could you provide a benchmark test? It seems that this PR causes performance degradation under high throughput conditions.

I don’t think that’s the case.

First, this code path is not hot — it runs once per worker schedule/output in MultiprocExecutor. The busy-poll existed only for local IPC (e.g. TP>1 on the same node). For remote sockets it’s already orders of magnitude slower, yet that’s not raised as a concern.

Second, the original code falls back to ZMQ for payloads larger than the shm buffer. In practice (e.g. a 16-GPU node), small chunks go through shm to avoid 16 redundant copies, while large chunks are still copied wholesale if they exceed VLLM_MQ_MAX_CHUNK_BYTES_MB=16. The supposed “optimization” is either negligible or backwards — and in any case, such behavior is unchanged.

Third, under true high-throughput conditions: if shm is ready and the thread wins the ring-buffer race, there’s zero extra polling. The only cost is a single pipe read/write per call, ~20 µs. That’s not even comparable to sched_yield(), let alone the actual overheads of inference and device communication.

Lastly, the old approach injected up to 100 ms latency during idle → active transitions due to time.sleep. This PR removes that cliff entirely.

Given all this, I’m not seeing how “performance degradation” applies here. Could you clarify the specific workload or measurements where you observed it? Without that context, the evidence points to this change improving latency without impacting throughput.

@noobpwnftw
Copy link
Author

For completeness: vLLM currently routes small items through SHM and large items through ZMQ — which is inverted from the more typical design (ZMQ for small control messages, SHM for large bulk transfers). My change doesn’t alter that; it only replaces busy-poll/sleep with a level-triggered wake. Any throughput concerns around very large broadcasts are therefore a separate threshold/transport discussion, orthogonal to this PR. That said, however, I believe you’d be better off with pure ZMQ IPC here, or at least flipping the policy so that small items use ZMQ and large payloads use SHM.

@noobpwnftw noobpwnftw force-pushed the zmq_tick branch 3 times, most recently from 50fb107 to d1c72ad Compare September 15, 2025 05:51
@chaunceyjiang
Copy link
Collaborator

For completeness: vLLM currently routes small items through SHM and large items through ZMQ — which is inverted from the more typical design (ZMQ for small control messages, SHM for large bulk transfers). My change doesn’t alter that; it only replaces busy-poll/sleep with a level-triggered wake. Any throughput concerns around very large broadcasts are therefore a separate threshold/transport discussion, orthogonal to this PR. That said, however, I believe you’d be better off with pure ZMQ IPC here, or at least flipping the policy so that small items use ZMQ and large payloads use SHM.

Thank you for the clarification~

/cc @njhill PTAL.

@noobpwnftw noobpwnftw force-pushed the zmq_tick branch 2 times, most recently from b7f63e7 to dedfbd9 Compare September 15, 2025 06:14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that during execute_model, it calls broadcast_tensor_dict to broadcast the tensor. I’m not sure if this affects anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual tensors are transferred via Torch C bindings, while the Python side only broadcasts metadata through SHM and later gathers outputs post-inference.

You raise a good point about potential impact on end-to-end latency. However, this path is not inside the per-token inference loop, and the added cost in Python is negligible (on the order of ~100 µs ≈ 0.1 ms, which is comparable to ~50 lines of Python). Meanwhile, CPU utilization drops from constant ~100% to ~8%.

Given that, I can’t construct a realistic scenario where this overhead becomes a bottleneck. For it to matter, scheduler–worker IPC would have to dominate performance over both device communication and Python interpreter overhead, which doesn’t seem plausible.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @noobpwnftw, I agree that we should do something along these lines. And had also thought similarly that it's kind of inverse of how shm would usually be used.

The reason for using shm in this way is to minimize latency multi-worker case, it would be good to benchmark TP=8 with a relatively small model and low concurrency to see if there's any impact from this change. I will do that too if I get a chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why suppressing error here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timed out before anything became available is returned as an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it just returns zero in this case

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this if benchmark shows no worse...

@njhill njhill changed the title Reuse ZMQ to level trigger local ShmRingBuffer events. [Core] Reuse ZMQ to level trigger local ShmRingBuffer events. Sep 17, 2025
@njhill
Copy link
Member

njhill commented Sep 18, 2025

@noobpwnftw I ran some benchmarks of very low-latency case, unfortunately this does perform quite a bit worse:

On 2x H100:

vllm serve meta-llama/Llama-3.2-1B-Instruct --disable-log-requests --uvicorn-log-level=error --tensor-parallel-size 2
vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --dataset-name random \
    --random-input-len 512 \
    --random-output-len 2000 \
    --ignore-eos \
    --num-prompts 12 \
    --max-concurrency 2 \
    --seed 42

Before:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  23.64     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.51      
Output token throughput (tok/s):         1015.39   
Peak output token throughput (tok/s):    1030.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1274.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          23.95     
Median TTFT (ms):                        23.86     
P99 TTFT (ms):                           26.52     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.96      
Median TPOT (ms):                        1.96      
P99 TPOT (ms):                           1.96      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.96      
Median ITL (ms):                         1.94      
P99 ITL (ms):                            2.69      
==================================================

This PR:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  30.52     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.39      
Output token throughput (tok/s):         786.36    
Peak output token throughput (tok/s):    836.00    
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          987.27    
---------------Time to First Token----------------
Mean TTFT (ms):                          26.53     
Median TTFT (ms):                        25.41     
P99 TTFT (ms):                           32.97     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.53      
Median TPOT (ms):                        2.53      
P99 TPOT (ms):                           2.67      
---------------Inter-token Latency----------------
Mean ITL (ms):                           2.53      
Median ITL (ms):                         2.54      
P99 ITL (ms):                            2.95      
==================================================

With async scheduling enabled the difference is much smaller but still noticeable:

Before:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  21.50     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.56      
Output token throughput (tok/s):         1116.40   
Peak output token throughput (tok/s):    1130.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1401.64   
---------------Time to First Token----------------
Mean TTFT (ms):                          17.83     
Median TTFT (ms):                        17.21     
P99 TTFT (ms):                           21.59     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.78      
Median TPOT (ms):                        1.78      
P99 TPOT (ms):                           1.79      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.78      
Median ITL (ms):                         1.77      
P99 ITL (ms):                            2.23      
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     12        
Maximum request concurrency:             2         
Benchmark duration (s):                  22.25     
Total input tokens:                      6132      
Total generated tokens:                  24000     
Request throughput (req/s):              0.54      
Output token throughput (tok/s):         1078.77   
Peak output token throughput (tok/s):    1118.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1354.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          20.19     
Median TTFT (ms):                        18.82     
P99 TTFT (ms):                           24.89     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.84      
Median TPOT (ms):                        1.85      
P99 TPOT (ms):                           1.86      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.84      
Median ITL (ms):                         1.83      
P99 ITL (ms):                            2.24      
==================================================

I think we could still try to fall-back to polling the socket after spinning for some number of milliseconds (> step time, though that can vary a lot depending on the model and batch size, etc) i.e. when "idle".

Or even experiment with something adaptive where based on predicted forward pass time we only start spinning after a small delay (i.e. in cases where the forward pass is longer)

under true high-throughput conditions: if shm is ready and the thread wins the ring-buffer race, there’s zero extra polling.

I don't think this is related to throughput, it's related to the step time.

@noobpwnftw
Copy link
Author

@njhill This is so unfortunate. Could try copy=False on the recv path to see if it improves anything.

Now I have written 2 variants:

  1. Gate this entirely under VLLM_SLEEP_WHEN_IDLE, updated in this PR.
  2. An adaptive tick mechanism under a separate ZMQ socket but with somewhat inconsistent idle → active transition latency(<=100ms). See here: main...noobpwnftw:vllm:zmq_tick2.

I don't think this is related to throughput, it's related to the step time.

My point is that if this happens so frequently within Python, then there is probably somewhere else to sqeeze the extra ms.

@njhill
Copy link
Member

njhill commented Sep 19, 2025

We could also explore the possibility of enabling this when async scheduling is enabled which we want to make the default soon anyhow.

@noobpwnftw noobpwnftw force-pushed the zmq_tick branch 3 times, most recently from 36d9c3f to a762ca0 Compare September 27, 2025 01:37
@noobpwnftw
Copy link
Author

I think this version should be ok.

Avoids busy polling and reduce wake up latency from sleeps.

Signed-off-by: noobpwnftw <guo.bojun@gmail.com>
@njhill
Copy link
Member

njhill commented Sep 27, 2025

Thanks @noobpwnftw ... could you explain why a new/separate socket is needed?

@noobpwnftw
Copy link
Author

To avoid overhead, subscribers now sleeps only after idle, wake ticks are not always consumed per handoff. A separate socket is therefore needed to correctly read overflowed payloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants