-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core] Reuse ZMQ to level trigger local ShmRingBuffer events. #24618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request replaces a busy-polling mechanism with a more efficient event-driven approach using ZMQ for local shared memory communication. This is a good improvement to reduce CPU usage when idle. The implementation is mostly correct, but there is an issue in acquire_read
where the poll timeout does not respect the function's overall timeout parameter, which could lead to unnecessary delays.
b8b5133
to
8a087e6
Compare
This is a better fix for #16226. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a benchmark test?
It seems that this PR causes performance degradation under high throughput conditions.
I don’t think that’s the case. First, this code path is not hot — it runs once per worker schedule/output in Second, the original code falls back to ZMQ for payloads larger than the shm buffer. In practice (e.g. a 16-GPU node), small chunks go through shm to avoid 16 redundant copies, while large chunks are still copied wholesale if they exceed Third, under true high-throughput conditions: if shm is ready and the thread wins the ring-buffer race, there’s zero extra polling. The only cost is a single pipe read/write per call, ~20 µs. That’s not even comparable to Lastly, the old approach injected up to 100 ms latency during idle → active transitions due to Given all this, I’m not seeing how “performance degradation” applies here. Could you clarify the specific workload or measurements where you observed it? Without that context, the evidence points to this change improving latency without impacting throughput. |
4f1de7f
to
d9eb852
Compare
For completeness: vLLM currently routes small items through SHM and large items through ZMQ — which is inverted from the more typical design (ZMQ for small control messages, SHM for large bulk transfers). My change doesn’t alter that; it only replaces busy-poll/sleep with a level-triggered wake. Any throughput concerns around very large broadcasts are therefore a separate threshold/transport discussion, orthogonal to this PR. That said, however, I believe you’d be better off with pure ZMQ IPC here, or at least flipping the policy so that small items use ZMQ and large payloads use SHM. |
50fb107
to
d1c72ad
Compare
Thank you for the clarification~ /cc @njhill PTAL. |
b7f63e7
to
dedfbd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that during execute_model
, it calls broadcast_tensor_dict
to broadcast the tensor. I’m not sure if this affects anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual tensors are transferred via Torch C bindings, while the Python side only broadcasts metadata through SHM and later gathers outputs post-inference.
You raise a good point about potential impact on end-to-end latency. However, this path is not inside the per-token inference loop, and the added cost in Python is negligible (on the order of ~100 µs ≈ 0.1 ms, which is comparable to ~50 lines of Python). Meanwhile, CPU utilization drops from constant ~100% to ~8%.
Given that, I can’t construct a realistic scenario where this overhead becomes a bottleneck. For it to matter, scheduler–worker IPC would have to dominate performance over both device communication and Python interpreter overhead, which doesn’t seem plausible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @noobpwnftw, I agree that we should do something along these lines. And had also thought similarly that it's kind of inverse of how shm would usually be used.
The reason for using shm in this way is to minimize latency multi-worker case, it would be good to benchmark TP=8 with a relatively small model and low concurrency to see if there's any impact from this change. I will do that too if I get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why suppressing error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timed out before anything became available is returned as an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it just returns zero in this case
dedfbd9
to
c82bad1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this if benchmark shows no worse...
960cba9
to
830dcda
Compare
@noobpwnftw I ran some benchmarks of very low-latency case, unfortunately this does perform quite a bit worse: On 2x H100:
Before:
This PR:
With async scheduling enabled the difference is much smaller but still noticeable: Before:
After:
I think we could still try to fall-back to polling the socket after spinning for some number of milliseconds (> step time, though that can vary a lot depending on the model and batch size, etc) i.e. when "idle". Or even experiment with something adaptive where based on predicted forward pass time we only start spinning after a small delay (i.e. in cases where the forward pass is longer)
I don't think this is related to throughput, it's related to the step time. |
830dcda
to
f04cf05
Compare
@njhill This is so unfortunate. Could try Now I have written 2 variants:
My point is that if this happens so frequently within Python, then there is probably somewhere else to sqeeze the extra ms. |
We could also explore the possibility of enabling this when async scheduling is enabled which we want to make the default soon anyhow. |
f04cf05
to
d02ce9b
Compare
36d9c3f
to
a762ca0
Compare
I think this version should be ok. |
Avoids busy polling and reduce wake up latency from sleeps. Signed-off-by: noobpwnftw <guo.bojun@gmail.com>
a762ca0
to
1b53c17
Compare
Thanks @noobpwnftw ... could you explain why a new/separate socket is needed? |
To avoid overhead, subscribers now sleeps only after idle, wake ticks are not always consumed per handoff. A separate socket is therefore needed to correctly read overflowed payloads. |
Purpose
Avoids busy polling and no wake up latency from sleeps.
Test Plan
None
Test Result
Works
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.