-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/gni: Fix use of FI_REMOTE_CQ_DATA. #1240
Conversation
prov/gni/include/gnix_vc.h
Outdated
@@ -118,6 +118,8 @@ enum gnix_vc_conn_req_type { | |||
* @var flags Bitmap used to hold vc schedule state | |||
* @var peer_irq_mem_hndl peer GNI memhndl used for delivering | |||
* GNI_PostCqWrite requests to remote peer | |||
* @var peer_apid TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll fill this TODO later. this is for the xpmem protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for the xpmem work, should it be part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have one comment. Otherwise, looks ok
prov/gni/include/gnix_util.h
Outdated
* | ||
* @return zero if FI_REMOTE_CQ_DATA is not permitted; otherwise one. | ||
*/ | ||
#define _gnix_assert_fi_remote_cq_data(_flags) (((_flags) & FI_REMOTE_CQ_DATA) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro doesn't assert. Please use a name more appropriate name. Maybe GNIX_FLAGS_REMOTE_CQ_DATA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about _gnix_allow_fi_remote_cq_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's ok with me
Update copyright headers. |
prov/gni/include/gnix_util.h
Outdated
* | ||
* @return zero if FI_REMOTE_CQ_DATA is not permitted; otherwise one. | ||
*/ | ||
#define _gnix_allow_fi_remote_cq_data(_flags) (((_flags) & FI_REMOTE_CQ_DATA) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Macros should be fully capitalized unless there is a good reason for it otherwise. That's how we distinguish macro from function at a glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both chuck and I are good with this name. Do you still want me to capatilize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chuckfossen Preferences on capitalization for macros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswaro @chuckfossen No need to discuss further, i'll just refactor this.
prov/gni/src/gnix_rma.c
Outdated
@@ -250,7 +250,7 @@ int __smsg_rma_data(void *data, void *msg) | |||
struct gnix_fid_ep *ep = vc->ep; | |||
gni_return_t status; | |||
|
|||
if (hdr->flags & FI_REMOTE_CQ_DATA && ep->recv_cq) { | |||
if (_GNIX_ALLOW_FI_REMOTE_CQ_DATA(hdr->flags) && ep->recv_cq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leading underscore is only really needed for function names. Macros don't need them. I know we're being picky but I don't think any of our macros have leading underscores. Let's remove it.
71a5822
to
8349a48
Compare
Fixes ofi-cray#1240. Total buffered receive is not well defined, and any attempt to clarify its meaning will result in forcing a specific implementation. The buffer size could be defined such that a single large message would consume the entire buffer, or the size could refer to the total amount of buffering provided for a set of small messages. (For example, if the provider implements a rendezvous protocol, then a single large message would not consume the entire buffer at the receive side). Redefine total_buffered_recv to be used as a hint to the provider as to the amount of unexpected messages that it may need to handle. It is up to the provider to interpret or use this value as needed. The FI_RM_ENABLED mode bit should be used by applications in place of relying on this field. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
It looks like the reference from ztiffany was not intended - it appears to be from an old commit by Sean Hefty. |
I'll wait until #1207 is merged in to update this PR. |
@@ -53,4 +54,9 @@ void calculate_time_difference(struct timeval *start, struct timeval *end, | |||
int *secs_out, int *usec_out); | |||
int dump_cq_error(struct fid_cq *cq, void *context, uint64_t flags); | |||
|
|||
static inline struct gnix_fid_ep *get_gnix_ep(struct fid_ep *fid_ep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this just a define/macro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave it. Let's use language constructs over the preprocessor whenever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditionally approved based on the other reviewers feedback. There isn't anything here that I don't think we couldn't fix easily later if need be.
You need to rebase and address conflicts. |
Please squash before merge. |
a2c7842
to
21f3ec7
Compare
I'm not sure I understand, could you elaborate?
Yes, I want to wait for #1207 to be merged so I can rebase this PR and replace the new references to FI_REMOTE_CQ_DATA to GNIX_ALLOW_FI_REMOTE_CQ_DATA. |
Sorry, I phrased that poorly. I'm just referring to the additional checks that you made in rdm_rma_check_tcqe. You ensured that they shouldn't be executed when fi_more is set. So in that regard things look good to me. If all you need from #1207 is to rename the references to GNIX_ALLOW_FI_REMOTE_CQ_DATA - I can take care of that after your PR. I still need to add a Test case to #1207, and may not get around to it for a few days yet. |
No worries. Thanks for reviewing these changes.
Okay, I'll check with my reviewers and then merge if they are good with these changes. |
@chuckfossen, @sungeunchoi, @a-abraham: Do you have any more feedback before I merge these changes? |
@@ -297,16 +300,18 @@ int rdm_api_check_data(char *buf1, char *buf2, int len) | |||
|
|||
void rdm_api_check_cqe(struct fi_cq_tagged_entry *cqe, void *ctx, | |||
uint64_t flags, void *addr, size_t len, | |||
uint64_t data) | |||
uint64_t data, struct fid_ep *fid_ep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we maintaining this function if it's not used? In fact, all the rdm_api_check_* functions are not used and should be removed.
@chuckfossen Is there a reason to keep them around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening an issue.
prov/gni/test/rdm_atomic.c
Outdated
@@ -481,7 +484,7 @@ void do_min(int len) | |||
|
|||
cr_assert_eq(ret, 1, "fi_cq_read returned %d %d", ret, | |||
dump_cq_error(send_cq[0], target, 0)); | |||
rdm_atomic_check_tcqe(&cqe, target, FI_ATOMIC | FI_WRITE, 0); | |||
rdm_atomic_check_tcqe(&cqe, target, FI_ATOMIC | FI_WRITE, 0, ep[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check makes no sense. You are checking the data field on the initiating (send) side. In order to test for the remote cq data, you need to request for remote CQ events on ep[1] and look at the events delivered there. This is true of all the checks you added in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Do we want to change these tests to report completions on the remote peer and read those completions or simply pass ep[0] into this test and others like it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't test this code path in many places, so I think at least some of them should change to get a remote event.
The RMA tests are set up for it (in fact, some of them check already), so I'd add it to those and skip the atomic tests for now. You can add an issue to add remote CQ event checking for the atomic tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thinking about this a little more, I think we need to test both paths, with and without RMA events. In reviewing the tests, if there is an obvious way to test both, let's do that. It might just be a matter of factoring the set up function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that both paths should be tested. One option for testing both paths would be to add new test suites for each existing test suite that sets FI_RMA_EVENT. Then, in each of the existing do functions, of which there are 52, we would need to check for fi rma event and poll the recv CQ if it's set. This last part could be encapsulated in a function. Add new test suites would also not be hard. I will update these tests unless there are objections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking more closely in the GNI provider, completion events are not generated on the remote peer for atomic operations, only counters are used. I can't tell from the man page, is this intended? If so, we can simply reset rdm_atomic.c. Sorry for the confusion and thank you for catching this mistake.
prov/gni/test/rdm_dgram_rma.c
Outdated
@@ -776,7 +792,7 @@ void do_write(int len) | |||
return; | |||
} | |||
cr_assert_eq(ret, 1); | |||
rdm_rma_check_tcqe(&cqe, target, FI_RMA | FI_WRITE, 0); | |||
rdm_rma_check_tcqe(&cqe, target, FI_RMA | FI_WRITE, 0, ep[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as above. Need to check the remote CQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As are many of the rest of these in this file.
@@ -1216,7 +1223,7 @@ static void do_writedata(int len) | |||
|
|||
rdm_rma_check_tcqe(&dcqe, NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update RMA test to check for remote CQ data.
b73abd2
to
fc4cd69
Compare
the LANL CI is failing with:
Is this related to the Criterion update? |
bot:lanl:retest |
We need #1276 merged and to rebase on top of master before the LANL CI will pass. |
- FI_REMOTE_CQ_DATA is only allowed when FI_RMA_EVENT is set on an EPs caps. - Added a macro to check the above condition. - Updated source and unit tests accordingly. Signed-off-by: Evan Harvey <eharvey@lanl.gov>
I don't like the check cqe functions, but in the interest of time, let's do the following.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No code changes.
Signed-off-by: Evan Harvey <eharvey@lanl.gov>
Your feedback has been addressed.
- FI_REMOTE_CQ_DATA is only allowed when FI_RMA_EVENT is set on an EPs caps. - Added a macro to check the above condition. - Updated source and unit tests accordingly. upstream merge of ofi-cray#1240 Signed-off-by: Evan Harvey <eharvey@lanl.gov> (cherry picked from commit ofi-cray/libfabric-cray@90577bd) (cherry picked from commit ofi-cray/libfabric-cray@85d64b6)
Ensure that FI_REMOTE_CQ_DATA is only used when
FI_RMA_EVENT is also set.
Added macro to check the previous condition.
Updated rma source and unit tests to reflect these
changes.
Updated VC struct doxy comment.
Note: I'm not sure we should be ORing in the FI_RMA_EVENT
flag in the changes I made for gnix_ep.c and the rdm* unit
tests. Also, the macro names might need some work.
Fixes #1190 and #1240.
Signed-off-by: Evan Harvey eharvey@lanl.gov