-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Open
Labels
usageHow to use vllmHow to use vllm
Description
Your current environment
I start the P and D services with the following parameters.
# kv_producer
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve /opt/work/Qwen2.5-VL-7B-Instruct-20250829-FP8-Dynamic --host 0.0.0.0 --port 8100 --tensor-parallel-size 1 --served-model-name base_model --dtype auto --max-model-len 10000 --limit-mm-per-prompt '{"image":12}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --max-num-seqs 20 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"2e9","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"0.0.0.0","proxy_port":"30001","http_port":"8100","send_type":"PUT_ASYNC"}}'
# kv_consumer
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve /opt/work/Qwen2.5-VL-7B-Instruct-20250829-FP8-Dynamic --port 8200 --served-model-name base_model --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --limit-mm-per-prompt '{"image":12}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --max-num-seqs 20 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"4e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.30.52.112","proxy_port":"30001","http_port":"8200","send_type":"PUT_ASYNC"}}'
after running for a moment, the D server shows error
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.10.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483699408896
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.11.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483707797504
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.12.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483716186112
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.13.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483724574720
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.14.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483732963328
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.15.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483741351936
(EngineCore_DP0 pid=3788) Exception in thread Thread-2 (listen_for_requests):
(EngineCore_DP0 pid=3788) Traceback (most recent call last):
(EngineCore_DP0 pid=3788) File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
(EngineCore_DP0 pid=3788) self.run()
(EngineCore_DP0 pid=3788) File "/usr/lib/python3.12/threading.py", line 1012, in run
(EngineCore_DP0 pid=3788) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=3788) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py", line 341, in listen_for_requests
(EngineCore_DP0 pid=3788) addr = self.pool.store_tensor(tensor)
(EngineCore_DP0 pid=3788) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3788) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py", line 198, in store_tensor
(EngineCore_DP0 pid=3788) addr = self.allocate(size)
(EngineCore_DP0 pid=3788) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3788) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py", line 135, in allocate
(EngineCore_DP0 pid=3788) raise ValueError("Insufficient memory")
(EngineCore_DP0 pid=3788) ValueError: Insufficient memory
using 4090-24g and 4090-24g/48g for test
vllm version: 0.10.2
Metadata
Metadata
Assignees
Labels
usageHow to use vllmHow to use vllm