Skip to content

[Usage]: run Qwen2.5-VL with PD disaggregation raise ValueError("Insufficient memory") #25759

@bingchen-mi

Description

@bingchen-mi

Your current environment

I start the P and D services with the following parameters.

# kv_producer

VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve /opt/work/Qwen2.5-VL-7B-Instruct-20250829-FP8-Dynamic --host 0.0.0.0 --port 8100 --tensor-parallel-size 1 --served-model-name base_model --dtype auto --max-model-len 10000 --limit-mm-per-prompt '{"image":12}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --max-num-seqs 20 --gpu-memory-utilization 0.7 --kv-transfer-config   '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"2e9","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"0.0.0.0","proxy_port":"30001","http_port":"8100","send_type":"PUT_ASYNC"}}'

# kv_consumer

VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve /opt/work/Qwen2.5-VL-7B-Instruct-20250829-FP8-Dynamic     --port 8200     --served-model-name base_model --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --limit-mm-per-prompt '{"image":12}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --max-num-seqs 20 --gpu-memory-utilization 0.7     --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"4e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.30.52.112","proxy_port":"30001","http_port":"8200","send_type":"PUT_ASYNC"}}'

after running for a moment, the D server shows error

(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.10.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483699408896
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.11.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483707797504
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.12.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483716186112
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.13.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483724574720
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.14.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483732963328
(EngineCore_DP0 pid=3788) WARNING 09-26 02:48:27 [p2p_nccl_engine.py:343] 🔴[PUT]Recv Tensor, Out Of Threshold, 10.53.178.233:22001👈10.30.52.112:21001, data:{'cmd': 'PUT', 'tensor_id': 'chatcmpl-___prefill_addr_10.30.52.112:21001___decode_addr_10.53.178.233:22001_295b5c7968e14e0c9013bae24bc5b5df#language_model.model.layers.15.self_attn.attn', 'shape': [2, 240, 16, 4, 128], 'dtype': 'bfloat16'}, addr:140483741351936
(EngineCore_DP0 pid=3788) Exception in thread Thread-2 (listen_for_requests):
(EngineCore_DP0 pid=3788) Traceback (most recent call last):
(EngineCore_DP0 pid=3788)   File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
(EngineCore_DP0 pid=3788)     self.run()
(EngineCore_DP0 pid=3788)   File "/usr/lib/python3.12/threading.py", line 1012, in run
(EngineCore_DP0 pid=3788)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=3788)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py", line 341, in listen_for_requests
(EngineCore_DP0 pid=3788)     addr = self.pool.store_tensor(tensor)
(EngineCore_DP0 pid=3788)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3788)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py", line 198, in store_tensor
(EngineCore_DP0 pid=3788)     addr = self.allocate(size)
(EngineCore_DP0 pid=3788)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3788)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py", line 135, in allocate
(EngineCore_DP0 pid=3788)     raise ValueError("Insufficient memory")
(EngineCore_DP0 pid=3788) ValueError: Insufficient memory

using 4090-24g and 4090-24g/48g for test
vllm version: 0.10.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    usageHow to use vllm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions