Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/gni: Fix use of FI_REMOTE_CQ_DATA. #1240

Merged
merged 2 commits into from
Mar 17, 2017
Merged

Conversation

e-harvey
Copy link

@e-harvey e-harvey commented Mar 1, 2017

  • Ensure that FI_REMOTE_CQ_DATA is only used when
    FI_RMA_EVENT is also set.

  • Added macro to check the previous condition.

  • Updated rma source and unit tests to reflect these
    changes.

  • Updated VC struct doxy comment.

Note: I'm not sure we should be ORing in the FI_RMA_EVENT
flag in the changes I made for gnix_ep.c and the rdm* unit
tests. Also, the macro names might need some work.

Fixes #1190 and #1240.

Signed-off-by: Evan Harvey eharvey@lanl.gov

@e-harvey e-harvey requested a review from sungeunchoi March 1, 2017 16:47
@hppritcha hppritcha added this to the 1.5.0 milestone Mar 1, 2017
@@ -118,6 +118,8 @@ enum gnix_vc_conn_req_type {
* @var flags Bitmap used to hold vc schedule state
* @var peer_irq_mem_hndl peer GNI memhndl used for delivering
* GNI_PostCqWrite requests to remote peer
* @var peer_apid TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll fill this TODO later. this is for the xpmem protocol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for the xpmem work, should it be part of this PR?

Copy link

@chuckfossen chuckfossen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have one comment. Otherwise, looks ok

*
* @return zero if FI_REMOTE_CQ_DATA is not permitted; otherwise one.
*/
#define _gnix_assert_fi_remote_cq_data(_flags) (((_flags) & FI_REMOTE_CQ_DATA) \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro doesn't assert. Please use a name more appropriate name. Maybe GNIX_FLAGS_REMOTE_CQ_DATA.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about _gnix_allow_fi_remote_cq_data?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's ok with me

@e-harvey
Copy link
Author

e-harvey commented Mar 2, 2017

Update copyright headers.

*
* @return zero if FI_REMOTE_CQ_DATA is not permitted; otherwise one.
*/
#define _gnix_allow_fi_remote_cq_data(_flags) (((_flags) & FI_REMOTE_CQ_DATA) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macros should be fully capitalized unless there is a good reason for it otherwise. That's how we distinguish macro from function at a glance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both chuck and I are good with this name. Do you still want me to capatilize?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuckfossen Preferences on capitalization for macros?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswaro @chuckfossen No need to discuss further, i'll just refactor this.

@@ -250,7 +250,7 @@ int __smsg_rma_data(void *data, void *msg)
struct gnix_fid_ep *ep = vc->ep;
gni_return_t status;

if (hdr->flags & FI_REMOTE_CQ_DATA && ep->recv_cq) {
if (_GNIX_ALLOW_FI_REMOTE_CQ_DATA(hdr->flags) && ep->recv_cq) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leading underscore is only really needed for function names. Macros don't need them. I know we're being picky but I don't think any of our macros have leading underscores. Let's remove it.

@e-harvey e-harvey force-pushed the issue1190 branch 2 times, most recently from 71a5822 to 8349a48 Compare March 6, 2017 16:41
ztiffany pushed a commit to ztiffany/libfabric-cray that referenced this pull request Mar 6, 2017
Fixes ofi-cray#1240.

Total buffered receive is not well defined, and any attempt to
clarify its meaning will result in forcing a specific implementation.
The buffer size could be defined such that a single large message
would consume the entire buffer, or the size could refer to the
total amount of buffering provided for a set of small messages.
(For example, if the provider implements a rendezvous protocol, then
a single large message would not consume the entire buffer at the
receive side).

Redefine total_buffered_recv to be used as a hint to the provider
as to the amount of unexpected messages that it may need to handle.
It is up to the provider to interpret or use this value as needed.
The FI_RM_ENABLED mode bit should be used by applications in place
of relying on this field.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
@e-harvey
Copy link
Author

e-harvey commented Mar 8, 2017

It looks like the reference from ztiffany was not intended - it appears to be from an old commit by Sean Hefty.

@e-harvey e-harvey removed the ongoing label Mar 8, 2017
@e-harvey
Copy link
Author

e-harvey commented Mar 8, 2017

I'll wait until #1207 is merged in to update this PR.

@sungeunchoi sungeunchoi requested a review from a-abraham March 8, 2017 18:34
@@ -53,4 +54,9 @@ void calculate_time_difference(struct timeval *start, struct timeval *end,
int *secs_out, int *usec_out);
int dump_cq_error(struct fid_cq *cq, void *context, uint64_t flags);

static inline struct gnix_fid_ep *get_gnix_ep(struct fid_ep *fid_ep)
Copy link
Member

@jswaro jswaro Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just a define/macro?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it. Let's use language constructs over the preprocessor whenever possible.

Copy link
Member

@jswaro jswaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionally approved based on the other reviewers feedback. There isn't anything here that I don't think we couldn't fix easily later if need be.

@jswaro
Copy link
Member

jswaro commented Mar 9, 2017

You need to rebase and address conflicts.

@jswaro
Copy link
Member

jswaro commented Mar 9, 2017

Please squash before merge.

@e-harvey e-harvey force-pushed the issue1190 branch 2 times, most recently from a2c7842 to 21f3ec7 Compare March 10, 2017 02:10
@a-abraham
Copy link

@e-harvey - It looks like you've disabled the functionality you added when the fi_more flag is present. In that case, do you still need to wait for #1207 ?

@e-harvey
Copy link
Author

@e-harvey - It looks like you've disabled the functionality you added when the fi_more flag is present.

I'm not sure I understand, could you elaborate?

In that case, do you still need to wait for #1207 ?

Yes, I want to wait for #1207 to be merged so I can rebase this PR and replace the new references to FI_REMOTE_CQ_DATA to GNIX_ALLOW_FI_REMOTE_CQ_DATA.

@a-abraham
Copy link

Sorry, I phrased that poorly. I'm just referring to the additional checks that you made in rdm_rma_check_tcqe. You ensured that they shouldn't be executed when fi_more is set. So in that regard things look good to me.

If all you need from #1207 is to rename the references to GNIX_ALLOW_FI_REMOTE_CQ_DATA - I can take care of that after your PR. I still need to add a Test case to #1207, and may not get around to it for a few days yet.

@e-harvey
Copy link
Author

Sorry, I phrased that poorly. I'm just referring to the additional checks that you made in rdm_rma_check_tcqe. You ensured that they shouldn't be executed when fi_more is set. So in that regard things look good to me.

No worries. Thanks for reviewing these changes.

If all you need from #1207 is to rename the references to GNIX_ALLOW_FI_REMOTE_CQ_DATA - I can take care of that after your PR. I still need to add a Test case to #1207, and may not get around to it for a few days yet.

Okay, I'll check with my reviewers and then merge if they are good with these changes.

@e-harvey
Copy link
Author

@chuckfossen, @sungeunchoi, @a-abraham: Do you have any more feedback before I merge these changes?

@@ -297,16 +300,18 @@ int rdm_api_check_data(char *buf1, char *buf2, int len)

void rdm_api_check_cqe(struct fi_cq_tagged_entry *cqe, void *ctx,
uint64_t flags, void *addr, size_t len,
uint64_t data)
uint64_t data, struct fid_ep *fid_ep)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we maintaining this function if it's not used? In fact, all the rdm_api_check_* functions are not used and should be removed.

@chuckfossen Is there a reason to keep them around?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening an issue.

@@ -481,7 +484,7 @@ void do_min(int len)

cr_assert_eq(ret, 1, "fi_cq_read returned %d %d", ret,
dump_cq_error(send_cq[0], target, 0));
rdm_atomic_check_tcqe(&cqe, target, FI_ATOMIC | FI_WRITE, 0);
rdm_atomic_check_tcqe(&cqe, target, FI_ATOMIC | FI_WRITE, 0, ep[1]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check makes no sense. You are checking the data field on the initiating (send) side. In order to test for the remote cq data, you need to request for remote CQ events on ep[1] and look at the events delivered there. This is true of all the checks you added in this file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Do we want to change these tests to report completions on the remote peer and read those completions or simply pass ep[0] into this test and others like it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't test this code path in many places, so I think at least some of them should change to get a remote event.

The RMA tests are set up for it (in fact, some of them check already), so I'd add it to those and skip the atomic tests for now. You can add an issue to add remote CQ event checking for the atomic tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thinking about this a little more, I think we need to test both paths, with and without RMA events. In reviewing the tests, if there is an obvious way to test both, let's do that. It might just be a matter of factoring the set up function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that both paths should be tested. One option for testing both paths would be to add new test suites for each existing test suite that sets FI_RMA_EVENT. Then, in each of the existing do functions, of which there are 52, we would need to check for fi rma event and poll the recv CQ if it's set. This last part could be encapsulated in a function. Add new test suites would also not be hard. I will update these tests unless there are objections.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking more closely in the GNI provider, completion events are not generated on the remote peer for atomic operations, only counters are used. I can't tell from the man page, is this intended? If so, we can simply reset rdm_atomic.c. Sorry for the confusion and thank you for catching this mistake.

@@ -776,7 +792,7 @@ void do_write(int len)
return;
}
cr_assert_eq(ret, 1);
rdm_rma_check_tcqe(&cqe, target, FI_RMA | FI_WRITE, 0);
rdm_rma_check_tcqe(&cqe, target, FI_RMA | FI_WRITE, 0, ep[1]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as above. Need to check the remote CQ.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As are many of the rest of these in this file.

@@ -1216,7 +1223,7 @@ static void do_writedata(int len)

rdm_rma_check_tcqe(&dcqe, NULL,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is okay.

Copy link

@sungeunchoi sungeunchoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update RMA test to check for remote CQ data.

@e-harvey e-harvey force-pushed the issue1190 branch 2 times, most recently from b73abd2 to fc4cd69 Compare March 15, 2017 16:08
@e-harvey
Copy link
Author

the LANL CI is failing with:

cc1: all warnings being treated as errors
make: *** [prov/gni/test/prov_gni_test_gnitest-av.o] Error 1
make: *** Waiting for unfinished jobs....
cc1: all warnings being treated as errors
make: *** [prov/gni/test/prov_gni_test_gnitest-bitmap.o] Error 1

Is this related to the Criterion update?

@e-harvey
Copy link
Author

bot:lanl:retest

@e-harvey
Copy link
Author

We need #1276 merged and to rebase on top of master before the LANL CI will pass.

 - FI_REMOTE_CQ_DATA is only allowed when FI_RMA_EVENT is
 set on an EPs caps.

 - Added a macro to check the above condition.

 - Updated source and unit tests accordingly.

Signed-off-by: Evan Harvey <eharvey@lanl.gov>
@sungeunchoi
Copy link

I don't like the check cqe functions, but in the interest of time, let's do the following.

  • Keep the code as is but add a short comment in all the check cqe functions that were modified to take eps. The comment should say that we should change this to only check for remote CQ data if the flags passed in include FI_RMA_EVENT. That is, get rid of GNIX_ALLOW_FI_REMOTE_CQ_DATA, first look at the flags, then check the ep.

  • Add an issue for the above described change, which includes passing in NULL for the cases when FI_RMA_EVENT is not set.

Copy link

@sungeunchoi sungeunchoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No code changes.

Signed-off-by: Evan Harvey <eharvey@lanl.gov>
@e-harvey e-harvey merged commit 697cc05 into ofi-cray:master Mar 17, 2017
hppritcha pushed a commit to hppritcha/libfabric that referenced this pull request Mar 22, 2017
 - FI_REMOTE_CQ_DATA is only allowed when FI_RMA_EVENT is
 set on an EPs caps.

 - Added a macro to check the above condition.

 - Updated source and unit tests accordingly.

upstream merge of ofi-cray#1240

Signed-off-by: Evan Harvey <eharvey@lanl.gov>
(cherry picked from commit ofi-cray/libfabric-cray@90577bd)
(cherry picked from commit ofi-cray/libfabric-cray@85d64b6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants