Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Update RemoteTensor support for dynamic model #29061

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

andrew-k-park
Copy link
Contributor

@andrew-k-park andrew-k-park commented Feb 19, 2025

Details:

  • For dynamic model, the plugin's output tensor does not set remote tensor and pre-allocation
  • Therefore, the memory of plugin's output tensor is allocated to usm_host and copied back to remote tensor's memory. So clEnqueueMapBuffer and explicit memcpy are host performance overhead for processing output tensor
  • Enqueue non-blocking copy from usm memory to cl_mem to reduce overhead for in-order-queue
  • Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor, and output prim reuse remote tensor's memory

temp4

Tickets:

  • 162095

@andrew-k-park andrew-k-park requested review from a team as code owners February 19, 2025 04:47
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Feb 19, 2025
@andrew-k-park andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch 2 times, most recently from d33fa95 to 5ac9bad Compare February 19, 2025 06:41
@andrew-k-park andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch 2 times, most recently from 9e743e9 to a1052ab Compare February 19, 2025 11:41
@@ -934,7 +934,8 @@ std::vector<cldnn::event::ptr> SyncInferRequest::prepare_output(size_t output_id
auto device_tensor_et = convert_to_supported_device_type(element_type);
bool convert_needed = is_convert_required(device_tensor_et, element_type);

if (is_remote_tensor_impl && !convert_needed && !is_dynamic) {
// Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor
if (is_remote_tensor_impl && !convert_needed && !user_tensor_pshape.is_dynamic()) {
m_plugin_outputs[output_idx] = user_tensor_wrapper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows to set memory to the network's output primitive, but it seems the last primitive can be optimized out in runtime in some cases like this:

GPU_Debug: primitive_inst.cpp:1802:prepare_primitive: Execute fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 (type: fully_connected) 
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[0] : gather:Gather_22637
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[1] : constant:self.lm_head.weight
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[2] : constant:self.lm_head.weight/scale
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[3] : constant:self.lm_head.weight/zero_point
GPU_Debug: primitive_inst.cpp:1806:prepare_primitive: -----------------------------------------------------------------
GPU_Debug: primitive_inst.cpp:282:update_shape: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: update shape is done by other: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:1304:do_runtime_skip_reorder: [do runtime skip reorder] update shape for user result:Result_26547
GPU_Debug: primitive_inst.cpp:291:update_shape: result:Result_26547: update shape dep [0] : fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 was: f32:bfyx:?x1x32000:nopad now: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_type_base.h:199:calc_output_layouts: result:Result_26547 input tensor: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_type_base.h:205:calc_output_layouts: result:Result_26547 output tensor: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:468:update_shape: result:Result_26547: update shape: was: f32:bfyx:?x1x32000:nopad now: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:1321:do_runtime_skip_reorder: [do runtime skip reorder] set user result:Result_26547 as can_be_optimized
GPU_Debug: primitive_inst.cpp:1219:update_impl: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 impl update: was: nullptr now: jit:gemm:any
GPU_Debug: primitive_inst.cpp:2249:update_weights: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: add original weights memory u8:bfyx:32000x4096:nopad to weights cache; cache_size=1/3
GPU_Debug: primitive_inst.cpp:929:realloc_if_needed: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: realloc output memory. 
GPU_Debug: primitive_inst.cpp:930:realloc_if_needed:  outputs[0]  Current buffer_size=0 Requested buffer_size=32000
GPU_Debug: memory_pool.cpp:178:get_from_non_padded_pool: [fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0(1462): output]
GPU_Debug: memory.cpp:27:MemoryTracker: Allocate 128000 bytes of usm_host allocation type ptr = 0x7fda8c34aa00 (current=1726312; max=4318149292)
GPU_Debug: primitive_inst.cpp:1924:prepare_primitive: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: execute jit:gemm:any (is_dynamic=0, can_be_optimized=0)
Dump program1_network1_0_fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0_dst0
GPU_Debug: primitive_inst.cpp:1801:prepare_primitive: -----------------------------------------------------------------
GPU_Debug: primitive_inst.cpp:1802:prepare_primitive: Execute result:Result_26547 (type: reorder) 
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[0] : fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0
GPU_Debug: primitive_inst.cpp:1806:prepare_primitive: -----------------------------------------------------------------
GPU_Debug: primitive_inst.cpp:282:update_shape: result:Result_26547: update shape is done by other: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:1188:update_impl: result:Result_26547 Skip impl update: primitive is optimized out
GPU_Debug: primitive_inst.cpp:1924:prepare_primitive: result:Result_26547: execute reorder_data (is_dynamic=1, can_be_optimized=1)

Is this case handled somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, when executing do_runtime_skip_reorder, runtime optimization is skipped if output memory has already been allocated and reorder is output node. However since it it limited to usm_device type for now, if the shared buffer type of RemoteTensor is usm_device, Reorder will be executed w/o runtime optimization.

if (alloc_type == allocation_type::usm_device && u->is_output()) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrew-k-park it is only effective for usm_device case but we need to all mem types.

I think we may be able to handlie it somehting like :

  1. add additional info of "remote tensor" to network (i.e., output addresses)
  2. in skip reorder check, if the reorder's output address is same as remote tensor address, do not skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to do not runtime skip for output reorder of it's using shared buffer memory from remote tensor as you suggested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeonbok @andrew-k-park, regarding point 2), maybe we can try to assign remote memory directly to the currently processing instance? I think that executing this reorder (or other optimized node) may negate the benefits of using remote tensors at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, your suggestion is to skip reorder but use remote tensor? I think that is good too :)

Copy link
Contributor

@yeonbok yeonbok Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So

  1. add additional info of "remote tensor" to network (i.e., output addresses)
  2. in skip reorder check, if the reorder's output address is same as remote tensor address, skip the reorder and set the output memory of the reorder's input node with remote tensor
    @sshlyapn right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeonbok yeah, thus we will benefit from both reorder optimization & remote tensors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sshlyapn So what should it be handled if the dep of reorder is fc? Larger size of memory may be needed than memory allocated from remote tensor because of fake aligned memory layout.

@andrew-k-park andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch 3 times, most recently from 7462772 to fde1563 Compare February 21, 2025 06:24
@@ -223,6 +225,8 @@ struct network {
/* Common memory pointer for shape_info */
memory::ptr _shape_info_ptr;

memory::ptr _output_remote_mem_ptr = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use unordered_map here, as network may have multiple outputs with assigned remote tensors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to handle multiple output tensors

Comment on lines -387 to -390
if (is_dynamic) {
// In dynamic case we allocate more than required to check that out tensor is reshaped correctly
allocated_out_shape[1]++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dynamic shape scenario, by setting the memory of the remote tensor to output memory of prim, assertion occurs by check logic as follows:

check_memory_to_set(*mem_new, ol);

Therefore, this is removed because the original shape assigned to the tensor should be used as is and should not be changed manually.

@@ -934,7 +934,8 @@ std::vector<cldnn::event::ptr> SyncInferRequest::prepare_output(size_t output_id
auto device_tensor_et = convert_to_supported_device_type(element_type);
bool convert_needed = is_convert_required(device_tensor_et, element_type);

if (is_remote_tensor_impl && !convert_needed && !is_dynamic) {
// Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor
if (is_remote_tensor_impl && !convert_needed && !user_tensor_pshape.is_dynamic()) {
m_plugin_outputs[output_idx] = user_tensor_wrapper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeonbok @andrew-k-park, regarding point 2), maybe we can try to assign remote memory directly to the currently processing instance? I think that executing this reorder (or other optimized node) may negate the benefits of using remote tensors at all

@andrew-k-park andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch from fde1563 to 85e8fc3 Compare February 24, 2025 07:43
Signed-off-by: Andrew Park <andrew.park@intel.com>
Signed-off-by: Andrew Park <andrew.park@intel.com>
Signed-off-by: Andrew Park <andrew.park@intel.com>
@andrew-k-park andrew-k-park force-pushed the improve_dyn_remote_tensor_perf branch from 38ea559 to bb3f391 Compare February 24, 2025 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants