Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hip] Added hip_device_group_device to the runtime. #18790

Merged
merged 33 commits into from
Dec 11, 2024

Conversation

AWoloszyn
Copy link
Contributor

@AWoloszyn AWoloszyn commented Oct 16, 2024

This gives us an interface for creating a logical device from a set of physical hip devices. In a future PR I plan on removing the normal hip_device ut for now, until the device_group_device is completed and hardened, I am keeping the original around. There are also some optimizations to do for when we have a single device in our device group.

This implementation currently passes CTS (as well as the new CTS tests added for device groups), but there is some work to complete.

  • Fix memory pooling (Will be a follow-up PR)
  • Make sure that collectives work as expected. (Follow-up PR)
  • Optimize our synchronization.
    • Currently synchronization across physical GPUs goes through the host, we should be able to avoid that, but it will take some additional work.
  • Rework the CTS tests a bit so that they are just normal CTS tests that get ignored if needed.
  • Move any cuda-specific bits out back into cuda.
  • fix iree_hal_hip_device_queue_flush which should no longer try and use the work queue.
  • audit all new functions and make sure static is used where necessary.

Copy link
Collaborator

@benvanik benvanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A doozy! It'll take me a bit to go through all of this but I've sprinkled a few comments in to start.

How much of this code would change if you weren't trying to keep the old hip_device around? Are there any simplifications you could do? If so, do you think you'll remember them or should you tag them with TODO(#XXX) and track that in an issue? Given the complexity here I want to ensure we don't end up with both copies or shadows of the old copy living forever. It doesn't feel like much here would or should change if you have 1 device or N devices and if you're just not sure about your code yet it's ok to let this sit in a branch for a bit while you get it to a point of stability. It's better that then it getting context switched out of your head after it lands and then we end up with lingering design decisions that were made for short term staging.

The major thing I'm concerned about is the several extra vtables as they imply a level of decoupling that brings about a lot of complexity in the code. Updating any signature for any call now requires traversing several layers of indirection in several files (including shared utils and given that it's in utils/ across other backends) and reading the code becomes more difficult. Given that I'm hell-bent on deleting HIP it feels like additional baggage for something that is unlikely to be reused. I know there's a hope of sharing this with CUDA but since that's not currently in the plans and CUDA would be the only mid-term/long-term user of it (maybe) the added cost feels hard to swallow for the project as a whole. Avoiding the vtables and keeping things simple is going to add the least overhead to the project followed second by moving this out of utils/ and keeping it local to the hip target would be best. Shared utils dirs should be for durable things that we want to ossify and be heavily reused both in-tree and out-of-tree - we may need this now but we don't want this forever :)

A good way to reason about HAL code is that is should be optimized for deletion/rewrites/refactorings: what we have will be deleted and rewritten several more times, the API will change as new devices/device types/features are introduced, and it's almost always better to have some duplication than it is to have things tightly coupled across the deletion/rewrite boundaries. A bulk of what's happening here in particular is plumbing, and plumbing pays the highest cost of spaghettification and the lowest cost of duplication (as find/replace can solve the duplication but can't solve the spaghetti).

Happy to chat more about this - I think we can simplify things and keep the scope small to unblock the work requiring this without adding too much extra complexity to the rest of the system. Anything that adds complexity just to HIP is fine and it's just the stuff that bleeds out of hip/ that is my concern.

runtime/src/iree/hal/utils/deferred_command_buffer.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/deferred_work_queue.c Outdated Show resolved Hide resolved
// iree_hal_hip_device_group_device_t
//===----------------------------------------------------------------------===//

typedef enum iree_hip_device_group_device_commandbuffer_type_e {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

command_buffer

runtime/src/iree/hal/drivers/hip/native_executable.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/per_device_information.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/event_semaphore.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/local_task/cts/CMakeLists.txt Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/device_group_copy_test.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/device_group_copy_test.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/CMakeLists.txt Outdated Show resolved Hide resolved
@ScottTodd ScottTodd added the hal/hip Runtime HIP HAL backend label Oct 17, 2024
@AWoloszyn AWoloszyn force-pushed the multidevice branch 3 times, most recently from 3fe45ae to f3019a6 Compare November 4, 2024 20:01
runtime/src/iree/hal/queue.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/utils/stream_tracing.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_driver.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/cts/multi_queue_dispatch_test.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/base/tree.h Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/base/queue_test.cc Outdated Show resolved Hide resolved
IREE_ASSERT_ARGUMENT(base_driver);
IREE_ASSERT_ARGUMENT(out_device);

uint64_t multi_count = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not iree_host_size_t


#include "iree/hal/drivers/hip/util/queue.h"

#include "iree/base/api.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to include something in a .c already included in the header

Suggested change
#include "iree/base/api.h"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we are (at least loosely) following the google style guide, but just want to make sure we are intentionally ignoring it here.

Anywhere else we are ignoring it that I should know about?

https://google.github.io/styleguide/cppguide.html#Include_What_You_Use

runtime/src/iree/hal/drivers/hip/util/queue.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/util/queue.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/util/queue.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_allocator.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_allocator.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_allocator.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/event_pool.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/event_pool.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/native_executable.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/native_executable.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_device.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_allocator.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/hip_allocator.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/event_semaphore.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/event_semaphore.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/event_pool.c Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/dispatch_thread.h Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/dispatch_thread.c Outdated Show resolved Hide resolved
runtime/src/iree/hal/drivers/hip/cleanup_thread.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@benvanik benvanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two minor fixes and then lgtm if you're happy with it and have confirmed the key use cases work - I'd maybe try shopping around to the folks working on sharded things (e.g. @aviator19941 as seen in #19428) and letting them sanity check the branch once you rebase (I don't think any of those things are covered in a CI today and if things don't work for it there may be a risk of rollback given the size of this PR - best to get ahead of it :)


iree_hal_hip_memory_pools_t memory_pools;

// Used in any place we need an event that is already signaled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this comment for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hah, removed the event that I had there, but forgot the comment that goes with it.

// iree_hal_command_buffer_t deferred record/replay wrapper
//===----------------------------------------------------------------------===//

// Records a command buffer that records into multiple command buffers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Records a command buffer that records into multiple command buffers
// Creates a command buffer that records into multiple command buffers

Instead of rebasing each of the individual 30+ changes, rebase
the entire thing, because there were a number of conflicts
against main.

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
It was submitting command buffers with an empty affinity.
Changed to IREE_HAL_QUEUE_AFFINITY_ANY instead.

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
This allows us to allocate/deallocate async so long as we are using the
default hip allocator. Based on iree-org#19074

---------

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
This prevents the long calls to iree_hal_deferred_command_buffer_apply
and hipGraphLaunch from blocking the main thread and makes using
using muliple devices/streams from a single thread more reasonable.

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
They don't actually work very well as the underlying allocator
has an unbounded amount of slack in assigning allocations,
so depending on the allocation patterns you can end up using
significantly more memory than necessary.

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
We returned the event after we added it to the semaphore,
but if another thread ended up waiting on the event before we recorded
we would incorrectly read "hipSuccess" thinking the event
was complete, and advancing the semaphore prematurely.

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
These are disabled before this change, continue to leave them
disabled.

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
@AWoloszyn AWoloszyn merged commit 0e71e72 into iree-org:main Dec 11, 2024
37 of 39 checks passed
@chrsmcgrr
Copy link
Contributor

@AWoloszyn @benvanik @ScottTodd

hey we noticed this PR carried a llvm-project submodule change. Was this intended? It changed quite a lot of the history.

@AWoloszyn
Copy link
Contributor Author

That was not intended thanks for bringing that up! I will revert the submodule change.

AWoloszyn added a commit to AWoloszyn/iree that referenced this pull request Dec 12, 2024
…18790

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
@AWoloszyn
Copy link
Contributor Author

#19476 There is the revert

Groverkss pushed a commit that referenced this pull request Dec 12, 2024
…19476)

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/hip Runtime HIP HAL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants