[GPU] Update RemoteTensor support for dynamic model #29061

andrew-k-park · 2025-02-19T04:47:33Z

Details:

For dynamic model, the plugin's output tensor does not set remote tensor and pre-allocation
Therefore, the memory of plugin's output tensor is allocated to usm_host and copied back to remote tensor's memory. So clEnqueueMapBuffer and explicit memcpy are host performance overhead for processing output tensor
Enqueue non-blocking copy from usm memory to cl_mem to reduce overhead for in-order-queue
Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor, and output prim reuse remote tensor's memory

Tickets:

162095

src/plugins/intel_gpu/src/plugin/sync_infer_request.cpp

sshlyapn · 2025-02-19T12:43:50Z

src/plugins/intel_gpu/src/plugin/sync_infer_request.cpp

@@ -934,7 +934,8 @@ std::vector<cldnn::event::ptr> SyncInferRequest::prepare_output(size_t output_id
    auto device_tensor_et = convert_to_supported_device_type(element_type);
    bool convert_needed = is_convert_required(device_tensor_et, element_type);

-    if (is_remote_tensor_impl && !convert_needed && !is_dynamic) {
+    // Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor
+    if (is_remote_tensor_impl && !convert_needed && !user_tensor_pshape.is_dynamic()) {
        m_plugin_outputs[output_idx] = user_tensor_wrapper;


This change allows to set memory to the network's output primitive, but it seems the last primitive can be optimized out in runtime in some cases like this:

GPU_Debug: primitive_inst.cpp:1802:prepare_primitive: Execute fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 (type: fully_connected) GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[0] : gather:Gather_22637 GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[1] : constant:self.lm_head.weight GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[2] : constant:self.lm_head.weight/scale GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[3] : constant:self.lm_head.weight/zero_point GPU_Debug: primitive_inst.cpp:1806:prepare_primitive: ----------------------------------------------------------------- GPU_Debug: primitive_inst.cpp:282:update_shape: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: update shape is done by other: f32:bfyx:1x1x32000:nopad GPU_Debug: primitive_inst.cpp:1304:do_runtime_skip_reorder: [do runtime skip reorder] update shape for user result:Result_26547 GPU_Debug: primitive_inst.cpp:291:update_shape: result:Result_26547: update shape dep [0] : fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 was: f32:bfyx:?x1x32000:nopad now: f32:bfyx:1x1x32000:nopad GPU_Debug: primitive_type_base.h:199:calc_output_layouts: result:Result_26547 input tensor: f32:bfyx:1x1x32000:nopad GPU_Debug: primitive_type_base.h:205:calc_output_layouts: result:Result_26547 output tensor: f32:bfyx:1x1x32000:nopad GPU_Debug: primitive_inst.cpp:468:update_shape: result:Result_26547: update shape: was: f32:bfyx:?x1x32000:nopad now: f32:bfyx:1x1x32000:nopad GPU_Debug: primitive_inst.cpp:1321:do_runtime_skip_reorder: [do runtime skip reorder] set user result:Result_26547 as can_be_optimized GPU_Debug: primitive_inst.cpp:1219:update_impl: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 impl update: was: nullptr now: jit:gemm:any GPU_Debug: primitive_inst.cpp:2249:update_weights: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: add original weights memory u8:bfyx:32000x4096:nopad to weights cache; cache_size=1/3 GPU_Debug: primitive_inst.cpp:929:realloc_if_needed: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: realloc output memory. GPU_Debug: primitive_inst.cpp:930:realloc_if_needed: outputs[0] Current buffer_size=0 Requested buffer_size=32000 GPU_Debug: memory_pool.cpp:178:get_from_non_padded_pool: [fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0(1462): output] GPU_Debug: memory.cpp:27:MemoryTracker: Allocate 128000 bytes of usm_host allocation type ptr = 0x7fda8c34aa00 (current=1726312; max=4318149292) GPU_Debug: primitive_inst.cpp:1924:prepare_primitive: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: execute jit:gemm:any (is_dynamic=0, can_be_optimized=0) Dump program1_network1_0_fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0_dst0 GPU_Debug: primitive_inst.cpp:1801:prepare_primitive: ----------------------------------------------------------------- GPU_Debug: primitive_inst.cpp:1802:prepare_primitive: Execute result:Result_26547 (type: reorder) GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[0] : fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 GPU_Debug: primitive_inst.cpp:1806:prepare_primitive: ----------------------------------------------------------------- GPU_Debug: primitive_inst.cpp:282:update_shape: result:Result_26547: update shape is done by other: f32:bfyx:1x1x32000:nopad GPU_Debug: primitive_inst.cpp:1188:update_impl: result:Result_26547 Skip impl update: primitive is optimized out GPU_Debug: primitive_inst.cpp:1924:prepare_primitive: result:Result_26547: execute reorder_data (is_dynamic=1, can_be_optimized=1)

Is this case handled somehow?

Currently, when executing do_runtime_skip_reorder, runtime optimization is skipped if output memory has already been allocated and reorder is output node. However since it it limited to usm_device type for now, if the shared buffer type of RemoteTensor is usm_device, Reorder will be executed w/o runtime optimization.

openvino/src/plugins/intel_gpu/src/graph/primitive_inst.cpp

Line 1291 in d3cdfe8

if (alloc_type == allocation_type::usm_device && u->is_output()) {

@andrew-k-park it is only effective for usm_device case but we need to all mem types.

I think we may be able to handlie it somehting like :

add additional info of "remote tensor" to network (i.e., output addresses)

in skip reorder check, if the reorder's output address is same as remote tensor address, do not skip

updated to do not runtime skip for output reorder of it's using shared buffer memory from remote tensor as you suggested

@yeonbok @andrew-k-park, regarding point 2), maybe we can try to assign remote memory directly to the currently processing instance? I think that executing this reorder (or other optimized node) may negate the benefits of using remote tensors at all

Okay, your suggestion is to skip reorder but use remote tensor? I think that is good too :)

So

add additional info of "remote tensor" to network (i.e., output addresses)

in skip reorder check, if the reorder's output address is same as remote tensor address, skip the reorder and set the output memory of the reorder's input node with remote tensor
@sshlyapn right?

@yeonbok yeah, thus we will benefit from both reorder optimization & remote tensors

@sshlyapn So what should it be handled if the dep of reorder is fc? Larger size of memory may be needed than memory allocated from remote tensor because of fake aligned memory layout.

src/plugins/intel_gpu/src/plugin/sync_infer_request.cpp

sshlyapn · 2025-02-21T13:08:11Z

src/plugins/intel_gpu/include/intel_gpu/graph/network.hpp

@@ -223,6 +225,8 @@ struct network {
    /* Common memory pointer for shape_info */
    memory::ptr _shape_info_ptr;

+    memory::ptr _output_remote_mem_ptr = nullptr;


We can use unordered_map here, as network may have multiple outputs with assigned remote tensors

Update to handle multiple output tensors

sshlyapn · 2025-02-21T13:13:09Z

src/plugins/intel_gpu/tests/functional/remote_tensor_tests/gpu_remote_tensor_tests.cpp

-    if (is_dynamic) {
-        // In dynamic case we allocate more than required to check that out tensor is reshaped correctly
-        allocated_out_shape[1]++;
-    }


Why was this removed?

In dynamic shape scenario, by setting the memory of the remote tensor to output memory of prim, assertion occurs by check logic as follows:

openvino/src/plugins/intel_gpu/src/graph/primitive_inst.cpp

Line 268 in 69d4b3b

check_memory_to_set(*mem_new, ol);

Therefore, this is removed because the original shape assigned to the tensor should be used as is and should not be changed manually.

sshlyapn · 2025-02-21T13:31:06Z

src/plugins/intel_gpu/src/plugin/sync_infer_request.cpp

@@ -934,7 +934,8 @@ std::vector<cldnn::event::ptr> SyncInferRequest::prepare_output(size_t output_id
    auto device_tensor_et = convert_to_supported_device_type(element_type);
    bool convert_needed = is_convert_required(device_tensor_et, element_type);

-    if (is_remote_tensor_impl && !convert_needed && !is_dynamic) {
+    // Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor
+    if (is_remote_tensor_impl && !convert_needed && !user_tensor_pshape.is_dynamic()) {
        m_plugin_outputs[output_idx] = user_tensor_wrapper;


@yeonbok @andrew-k-park, regarding point 2), maybe we can try to assign remote memory directly to the currently processing instance? I think that executing this reorder (or other optimized node) may negate the benefits of using remote tensors at all

Signed-off-by: Andrew Park <andrew.park@intel.com>

…ry from remote tensor

andrew-k-park requested review from a team as code owners February 19, 2025 04:47

github-actions bot added the category: GPU OpenVINO GPU plugin label Feb 19, 2025

andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch 2 times, most recently from d33fa95 to 5ac9bad Compare February 19, 2025 06:41

yeonbok reviewed Feb 19, 2025

View reviewed changes

src/plugins/intel_gpu/src/plugin/sync_infer_request.cpp Outdated Show resolved Hide resolved

andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch 2 times, most recently from 9e743e9 to a1052ab Compare February 19, 2025 11:41

sshlyapn reviewed Feb 19, 2025

View reviewed changes

andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch 3 times, most recently from 7462772 to fde1563 Compare February 21, 2025 06:24

andrew-k-park requested review from yeonbok and sshlyapn February 21, 2025 06:27

sshlyapn reviewed Feb 21, 2025

View reviewed changes

andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch from fde1563 to 85e8fc3 Compare February 24, 2025 07:43

andrew-k-park added 6 commits February 24, 2025 19:40

Improve performance for RemoteTensor w/ dynamic model

04f676a

Signed-off-by: Andrew Park <andrew.park@intel.com>

Apply comment

b856c13

Signed-off-by: Andrew Park <andrew.park@intel.com>

Apply comments

e5c8366

Signed-off-by: Andrew Park <andrew.park@intel.com>

Fix failed tests for ov_gpu_func_tests

ba9e5dc

Do not runtime skip for reorder if output is using shared buffer memo…

c93ada9

…ry from remote tensor

Apply comment

bb3f391

andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch from 38ea559 to bb3f391 Compare February 24, 2025 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Update RemoteTensor support for dynamic model #29061

[GPU] Update RemoteTensor support for dynamic model #29061

andrew-k-park commented Feb 19, 2025 •

edited

Loading

sshlyapn Feb 19, 2025

andrew-k-park Feb 20, 2025

yeonbok Feb 20, 2025

andrew-k-park Feb 21, 2025

sshlyapn Feb 21, 2025

yeonbok Feb 24, 2025

yeonbok Feb 24, 2025 •

edited

Loading

sshlyapn Feb 24, 2025

andrew-k-park Feb 24, 2025

sshlyapn Feb 21, 2025

andrew-k-park Feb 24, 2025

sshlyapn Feb 21, 2025

andrew-k-park Feb 24, 2025

sshlyapn Feb 21, 2025

[GPU] Update RemoteTensor support for dynamic model #29061

Are you sure you want to change the base?

[GPU] Update RemoteTensor support for dynamic model #29061

Conversation

andrew-k-park commented Feb 19, 2025 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeonbok Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrew-k-park commented Feb 19, 2025 •

edited

Loading

yeonbok Feb 24, 2025 •

edited

Loading