Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uring enablement #8

Open
wants to merge 393 commits into
base: main
Choose a base branch
from
Open

uring enablement #8

wants to merge 393 commits into from

Conversation

ooststep
Copy link
Owner

@ooststep ooststep commented Aug 9, 2024

No description provided.

tmh97 and others added 26 commits October 18, 2024 10:53
Signed-off-by: Thomas Huber <thomas.huber@cornelisnetworks.com>
Replace running of on-merge workflow with a nightly workflow instead.

Signed-off-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Do not use PR closed events as workflow triggers.
Allow triggering PR events when targeting any branch, not just main.
Change cron schedule to account for UTC.
Improve conditional execution of reusable workflows.

Signed-off-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
This commit adds support for link up/down events in OPX for WFR platforms.

Signed-off-by: Archana Venkatesha <archana.venkatesha@cornelisnetworks.com>
Signed-off-by: Mike Wilkins <michael.wilkins@cornelisnetworks.com>
Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Fix 16B PBC/payload lengths

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Mike Wilkins <michael.wilkins@cornelisnetworks.com>
Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Mike Wilkins <michael.wilkins@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
… rendezvous

Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Add ability to independently tune the minimum threshold to use
expected receive (TID) when sending.

Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Shorten the name field of opx-ci.
Remove schedule-triggered Nightly job.

Signed-off-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Also store pad value not buffer data

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Thomas Huber <thomas.huber@cornelisnetworks.com>
Signed-off-by: Elias Kozah <elias.elkozah@cornelisnetworks.com>
…resulting from ignored context creation error

Signed-off-by: Elias Kozah <Elias.Elkozah@cornelisnetworks.com>
…vous performance

Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
The OPX provider was explicitly setting FI_REMOTE_CQ_DATA on all receive operations; however, it should only set the flag to indicate that the data field contains the completion data provided by the peer as part of their transmit request.

Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
shijin-aws and others added 26 commits January 21, 2025 21:19
Currently, efa_base_ep's default rnr_retry is 3 which only
does a few retry in the firmware level for RNR. This is
due to the efa_rdm_ep supports libfabric level RNR retry.
However, the efa-direct ep doesn't support libfabric
level RNR retry. Then we should make it do infinite
RNR retry (7), which is also the default behavior of
SRD QP.

Signed-off-by: Shi Jin <sjina@amazon.com>
This commit removes the x86-64 architecture check from the static_assert
conditional compilation directive. The static_assert feature is not
architecture-dependent and should be checked on all platforms that
support it.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Store the completion flags and peer address in FI_CONTEXT2 and
retrieve later when writing cq.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Other memory monitors, such as CUDA, ROCR, and ZE, have a .c file for
the implementation. This change cleans up the util_mem_monitor.c code by
defining a uffd and import .c file, thus aligning to other memory
monitor implementations.

Signed-off-by: Mike Uttormark <mike.uttormark@hpe.com>
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Some memory monitors, such as kdreg2, have a subscription context per MR
cache entry. These memory monitors require unsubscribe to be called for
each freed MR cache entry.

To support this, call unsubscribe when an entry is remove from the MR
cache RB tree.

If a memory monitor does not support a subscription context per MR,
unsubscribe must be implemented as a noop. Update uffd and rocr memory
monitors accordingly.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
ROCR deallocation CB will call rocr_unsubscribe with mm_lock held. If
memhooks is used, since rocr_unsubscribe may call free, this can result
in memhooks intercepting the free and leading to deadlock.

To avoid this, freeing is deferred until locks are released.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Subscribe, unsubscribe, and valid are callbacks which are dynamically
setup. Change this to be statically set.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
An MR cache utilizing kdreg2 will have incorrect MR cache count stats if
unsubscribe is not called.

Signed-off-by: Mike Uttormark <mike.uttormark@hpe.com>
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
prior to this patch, when efa_hmem_info_check_p2p_support_cuda elected
to attempt dmabuf for p2p, we previously leaked the file descriptor
returned by cuMemGetHandleForAddressRange in all cases. This ultimately
meant the dmabuf stuck around for the lifetime of the process, even
after dereg and after releasing the memory back to the device mempool.

All calls to cuda_get_dmabuf_fd need a corresponding close call.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
For some HMEM ifaces, ofi_hmem_get_dmabuf_fd() may result in a new FD
being allocated. Define ofi_hmem_put_dmabuf_fd() to close FD.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
With ROCR, callers of ofi_hmem_get_dmabuf_fd() should call
ofi_hmem_put_dmabuf_fd() once the DMA buf region is no longer used.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
With CUDA, callers of ofi_hmem_get_dmabuf_fd() should call
ofi_hmem_put_dmabuf_fd() once the DMA buf region is no longer used.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Performing multiple HSA allocations appears to result in a DMA buf
offset. Verify that the CXI provider can register a DMA buf offset
memory region.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
When a MR is freed, the CXI provider should free the DMA buf FD used for
the ROCR region. Failing to do this will result in FDs being exhausted.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
When a MR is freed, the CXI provider should free the DMA buf FD used for
the CUDA region. Failing to do this will result in FDs being exhausted.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
This allows testing FI_CONTEXT2 in providers that require this
mode bit.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Use cuda_put_dmabuf_fd to close fd

Signed-off-by: Shi Jin <sjina@amazon.com>
Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
When multiple multi-recv buffers are posted, FI_MULTI_RECV would
only be set on error if an mrecv entry was already created,
meaning the buffer would have already been in-use. If the buffer
has not been used yet and a cancelation for this buffer has been
processed, correctly set FI_MULTI_RECV when reporting the error,
indicating that the buffer is no longer in use.

Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>
This ensures that the libcurl dlopen path is correct

If the user passes '--with-curl=<path>' to configure, then the dlopen of
libcurl should honor that selection and use the file path passed in

Signed-off-by: John Biddiscombe <biddisco@cscs.ch>
Signed-off-by: John Biddiscombe <biddisco@cscs.ch>
Bumps [actions/stale](https://github.com/actions/stale) from 9.0.0 to 9.1.0.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](actions/stale@28ca103...5bef64f)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.28.1 to 3.28.5.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@b6a472f...f6091c0)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
we may receive uring events before we're fully connected so
don't try to progress rx until that connection is established

Signed-off-by: Stephen Oost <stephen.oost@intel.com>
the previously used io_uring_prep_readv function does not
support flags, instead flags were being passed as an offset,
triggering an illegal seek error

Signed-off-by: Stephen Oost <stephen.oost@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.