-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Update RemoteTensor support for dynamic model #29061
base: master
Are you sure you want to change the base?
[GPU] Update RemoteTensor support for dynamic model #29061
Conversation
d33fa95
to
5ac9bad
Compare
9e743e9
to
a1052ab
Compare
@@ -934,7 +934,8 @@ std::vector<cldnn::event::ptr> SyncInferRequest::prepare_output(size_t output_id | |||
auto device_tensor_et = convert_to_supported_device_type(element_type); | |||
bool convert_needed = is_convert_required(device_tensor_et, element_type); | |||
|
|||
if (is_remote_tensor_impl && !convert_needed && !is_dynamic) { | |||
// Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor | |||
if (is_remote_tensor_impl && !convert_needed && !user_tensor_pshape.is_dynamic()) { | |||
m_plugin_outputs[output_idx] = user_tensor_wrapper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows to set memory to the network's output primitive, but it seems the last primitive can be optimized out in runtime in some cases like this:
GPU_Debug: primitive_inst.cpp:1802:prepare_primitive: Execute fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 (type: fully_connected)
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[0] : gather:Gather_22637
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[1] : constant:self.lm_head.weight
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[2] : constant:self.lm_head.weight/scale
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[3] : constant:self.lm_head.weight/zero_point
GPU_Debug: primitive_inst.cpp:1806:prepare_primitive: -----------------------------------------------------------------
GPU_Debug: primitive_inst.cpp:282:update_shape: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: update shape is done by other: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:1304:do_runtime_skip_reorder: [do runtime skip reorder] update shape for user result:Result_26547
GPU_Debug: primitive_inst.cpp:291:update_shape: result:Result_26547: update shape dep [0] : fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 was: f32:bfyx:?x1x32000:nopad now: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_type_base.h:199:calc_output_layouts: result:Result_26547 input tensor: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_type_base.h:205:calc_output_layouts: result:Result_26547 output tensor: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:468:update_shape: result:Result_26547: update shape: was: f32:bfyx:?x1x32000:nopad now: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:1321:do_runtime_skip_reorder: [do runtime skip reorder] set user result:Result_26547 as can_be_optimized
GPU_Debug: primitive_inst.cpp:1219:update_impl: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0 impl update: was: nullptr now: jit:gemm:any
GPU_Debug: primitive_inst.cpp:2249:update_weights: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: add original weights memory u8:bfyx:32000x4096:nopad to weights cache; cache_size=1/3
GPU_Debug: primitive_inst.cpp:929:realloc_if_needed: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: realloc output memory.
GPU_Debug: primitive_inst.cpp:930:realloc_if_needed: outputs[0] Current buffer_size=0 Requested buffer_size=32000
GPU_Debug: memory_pool.cpp:178:get_from_non_padded_pool: [fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0(1462): output]
GPU_Debug: memory.cpp:27:MemoryTracker: Allocate 128000 bytes of usm_host allocation type ptr = 0x7fda8c34aa00 (current=1726312; max=4318149292)
GPU_Debug: primitive_inst.cpp:1924:prepare_primitive: fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0: execute jit:gemm:any (is_dynamic=0, can_be_optimized=0)
Dump program1_network1_0_fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0_dst0
GPU_Debug: primitive_inst.cpp:1801:prepare_primitive: -----------------------------------------------------------------
GPU_Debug: primitive_inst.cpp:1802:prepare_primitive: Execute result:Result_26547 (type: reorder)
GPU_Debug: primitive_inst.cpp:1804:prepare_primitive: - inputs[0] : fullyconnectedcompressed:__module.lm_head/ov_ext::linear/MatMul.0
GPU_Debug: primitive_inst.cpp:1806:prepare_primitive: -----------------------------------------------------------------
GPU_Debug: primitive_inst.cpp:282:update_shape: result:Result_26547: update shape is done by other: f32:bfyx:1x1x32000:nopad
GPU_Debug: primitive_inst.cpp:1188:update_impl: result:Result_26547 Skip impl update: primitive is optimized out
GPU_Debug: primitive_inst.cpp:1924:prepare_primitive: result:Result_26547: execute reorder_data (is_dynamic=1, can_be_optimized=1)
Is this case handled somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, when executing do_runtime_skip_reorder
, runtime optimization is skipped if output memory has already been allocated and reorder is output node. However since it it limited to usm_device
type for now, if the shared buffer type of RemoteTensor is usm_device
, Reorder will be executed w/o runtime optimization.
if (alloc_type == allocation_type::usm_device && u->is_output()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrew-k-park it is only effective for usm_device case but we need to all mem types.
I think we may be able to handlie it somehting like :
- add additional info of "remote tensor" to network (i.e., output addresses)
- in skip reorder check, if the reorder's output address is same as remote tensor address, do not skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to do not runtime skip for output reorder of it's using shared buffer memory from remote tensor as you suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yeonbok @andrew-k-park, regarding point 2), maybe we can try to assign remote memory directly to the currently processing instance? I think that executing this reorder (or other optimized node) may negate the benefits of using remote tensors at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, your suggestion is to skip reorder but use remote tensor? I think that is good too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
- add additional info of "remote tensor" to network (i.e., output addresses)
- in skip reorder check, if the reorder's output address is same as remote tensor address, skip the reorder and set the output memory of the reorder's input node with remote tensor
@sshlyapn right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yeonbok yeah, thus we will benefit from both reorder optimization & remote tensors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sshlyapn So what should it be handled if the dep of reorder
is fc
? Larger size of memory may be needed than memory allocated from remote tensor because of fake aligned memory layout.
7462772
to
fde1563
Compare
@@ -223,6 +225,8 @@ struct network { | |||
/* Common memory pointer for shape_info */ | |||
memory::ptr _shape_info_ptr; | |||
|
|||
memory::ptr _output_remote_mem_ptr = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use unordered_map here, as network may have multiple outputs with assigned remote tensors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update to handle multiple output tensors
if (is_dynamic) { | ||
// In dynamic case we allocate more than required to check that out tensor is reshaped correctly | ||
allocated_out_shape[1]++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dynamic shape scenario, by setting the memory of the remote tensor to output memory of prim, assertion occurs by check logic as follows:
check_memory_to_set(*mem_new, ol); |
Therefore, this is removed because the original shape assigned to the tensor should be used as is and should not be changed manually.
@@ -934,7 +934,8 @@ std::vector<cldnn::event::ptr> SyncInferRequest::prepare_output(size_t output_id | |||
auto device_tensor_et = convert_to_supported_device_type(element_type); | |||
bool convert_needed = is_convert_required(device_tensor_et, element_type); | |||
|
|||
if (is_remote_tensor_impl && !convert_needed && !is_dynamic) { | |||
// Even if the network is dynamic, if user tensor's shape is static, remote tensor can be set as plugin's output tensor | |||
if (is_remote_tensor_impl && !convert_needed && !user_tensor_pshape.is_dynamic()) { | |||
m_plugin_outputs[output_idx] = user_tensor_wrapper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yeonbok @andrew-k-park, regarding point 2), maybe we can try to assign remote memory directly to the currently processing instance? I think that executing this reorder (or other optimized node) may negate the benefits of using remote tensors at all
fde1563
to
85e8fc3
Compare
Signed-off-by: Andrew Park <andrew.park@intel.com>
Signed-off-by: Andrew Park <andrew.park@intel.com>
Signed-off-by: Andrew Park <andrew.park@intel.com>
…ry from remote tensor
38ea559
to
bb3f391
Compare
Details:
usm_host
and copied back to remote tensor's memory. SoclEnqueueMapBuffer
and explicit memcpy are host performance overhead for processing output tensorusm memory
tocl_mem
to reduce overhead for in-order-queueTickets: