Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hanging in transport layer #121

Merged
merged 67 commits into from
Jul 5, 2021
Merged

Fix hanging in transport layer #121

merged 67 commits into from
Jul 5, 2021

Conversation

Shillaker
Copy link
Collaborator

@Shillaker Shillaker commented Jun 18, 2021

This PR attempts to fix the issues causing the application to hang indefinitely. This was apparently due to us making some mistakes with the lifecycle of ZeroMQ sockets in our transport layer, and also the use of fragile timeouts in our tests.

AFAICT we must ensure that:

  • The ZeroMQ context outlives the lifetime of all sockets (i.e. sockets must be constructed and destructed after the context has been opened, but before the context is closed). Note that it is insufficient just to close sockets before the context is closed, their destructors must have been called. If this isn't the case, the context shutdown will hang.
  • Any two local sockets performing communication must not be destructed before the other socket has completed its side of the communcation (e.g. we cannot destruct a push socket before the pull socket has finished its recv call).
  • We do not rely on the destructors of any global variables, as this violates the first point (a global destructor will be called after we've closed the context). If we do have any sockets (or pointers to them) existing in a global scope, we must force their destructors to be called when we're finished with them.
  • We avoid sleeps to reason about threaded operations wherever possible, instead using latches or other synchronisation primitives. If we absolutely must have a sleep, we need to wrap it in a retry to tolerate the underlying operations being several times slower in different environments (i.e. slow CI servers).

Previously we were creating multiple contexts which made it difficult to reason about these rules, so this PR switches to using only one. We were using ephemeral sockets which would occasionally break the second rule, i.e. be destructed before their communication was complete. Finally we had the ZeroMQ contexts living in a global scope, as well as sockets related to MPI living in a thread-local global scope; these could not be removed, but we now make sure they are cleared up properly.

The changes in this PR are:

  • Switch to using cppzmq RAII semantics for managing socket and context shutdown. Previously we were manually closing down sockets, but the library will handle this automatically.
  • Use REQ/REP sockets where we need synchronous client/ server behaviour, and PULL/PUSH where we just need to send and forget. Previously we were only using PULL/PUSH and rolling our own synchronous response mechanism which necessitated an ephemeral PULL socket on the client. This caused port conflicts when clients were running concurrently.
  • Change MessageEndpointServer and MessageEndpointClient to have two sockets each, one for async, one for sync.
  • Go from two types of endpoint (SendMessageEndpoint/ RecvMessageEndpoint) to four (SyncSendMessageEndpoint, AsyncSendMessageEndpoint, SyncRecvMessageEndpoint, AsyncRecvMessageEndpoint).
  • Subclasses of MessageEndpointServer now have to implement doAsyncRecv and doSyncRecv accordingly. doSyncRecv returns the response message. Subclasses of MessageEndpointClient can call syncSend and asyncSend, with syncSend taking a pointer to a response which gets populated.
  • Use a single, global message context only ever opened and closed once and used by all threads. Previously we were creating and destroying contexts regularly and using this as a signal for when things were shut down.
  • Use an explicit shutdown message to close a server. This makes it easier to trace where and when servers get closed.
  • Remove MessageContext wrapper object and just use the basic zmq::context_t. Without the need to use contexts for closing things down, we just need a shared pointer to the global context.
  • Add a linger to every socket, this avoids one source of infinite hangs when sockets are still in a valid state with outstanding messages.
  • Remove the MessageEndpoint::open method and merge the opening of the socket into the constructor (it was always called straight afterwards anyway).
  • Remove the need for "persistent" Messages by copying the message body in the snapshot server. This simplifies the Message class and means we can use a stdlib vector rather than a raw pointer, which removes complexity around message lifecycle. We can introduce a zero-copy recv approach for the snapshot server in a later PR.
  • Standardise error checking around zmq operations in the MessageEndpoint subclasses.
  • Remove the DummyStateServer used in tests, moved to a simpler test fixture in a single test file. This removes another moving part in the tests (and makes debugging the state tests much easier).
  • Remove sleeps where possible, replacing with a latch. Wrap any remaining sleeps in at least one retry, including the new REQUIRE_RETRY macro used in tests.
  • Change the existing Barrier class to a simpler Latch shared between threads via a shared_ptr.

MPI-specific:

  • Move sending and receiving of the rank-to-host mapping into the MpiWorld class and avoid ephemeral sockets.
  • Split the broadcasting of rank-to-host messages into a separate function. Previously this was rolled into the world creation, and would still send messages in tests even when mock mode was on. The amount of messaging done implicitly in the world creation was getting difficult to manage, so splitting this out also makes it clearer.
  • Refactor the MPI "remote" server tests, which were hanging regularly and using slightly confusing terminology around "remote" vs. "local", when in fact both worlds are running locally (one is just in a different thread).

We now manage the zmq context in a global shared_ptr which works fine until it comes to destruction. It seems that cppzmq doesn't like being destructed so late in the application lifecycle and prints an error message along the lines of:

Invalid argument ... (.../mutex.hpp:123)

The solution to this is to explicitly close the context from the main method. This means we'll have to add the opening and closing of the zmq context to any main method using Faabric for now. I've added a check in getGlobalMessageContext which errors if it hasn't been initialised, which should prompt us into adding it where necessary.

@Shillaker Shillaker self-assigned this Jun 18, 2021
@Shillaker Shillaker changed the title Fixes for transport shutdown hanging Simplifying transport layer shutdown Jun 18, 2021
@@ -67,6 +67,8 @@ class MessageEndpointServer
private:
const int port;

std::unique_ptr<RecvMessageEndpoint> endpoint = nullptr;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fell out of the refactor away from using the message context to shut down. Now the endpoint server can manage its own endpoint internally.

} catch (zmq::error_t& e) {
SPDLOG_ERROR("Error sending message: {}", e.what());
throw;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if/ else was essentially doing the same send operation in both branches, just with different send_flags, so I merged it into one.

// Print default message and rethrow
SPDLOG_ERROR("Error receiving message: {} ({})", e.num(), e.what());
throw;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make things simpler to parse I split recv up into two methods based on whether a size is provided.

@@ -559,7 +559,7 @@ uint32_t StateKeyValue::waitOnRedisRemoteLock(const std::string& redisKey)
break;
}

std::this_thread::sleep_for(std::chrono::milliseconds(1));
SLEEP_MS(1);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was dealing with so many sleeps and writing usleep(123 * 1000) that I made a macro.

@Shillaker Shillaker changed the title Simplifying transport layer Fix hanging in transport layer Jun 28, 2021
@Shillaker Shillaker marked this pull request as draft June 28, 2021 17:59

virtual void stop();
void awaitAsyncLatch();
Copy link
Collaborator Author

@Shillaker Shillaker Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These setAsyncLatch and awaitAsyncLatch functions are to support testing. Tests frequently check the side effects of sending an async message by sending it, then performing some arbitrary sleep before checking that it's done. This allows us to avoid the sleep and the inherent flakiness of that approach.

@Shillaker Shillaker marked this pull request as ready for review June 29, 2021 13:50
@Shillaker Shillaker requested a review from csegarragonz July 1, 2021 10:45
Copy link
Collaborator

@csegarragonz csegarragonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a beast of a PR, and what a grueling effort as well! Fingers crossed we can now stop re-running the tests until they pass 🤣

Seeing this finished, the changes are actually quite easy to follow.

I've left very minor comments. I guess my biggest concern is with managing a lot of google::protobuf::Message objects, which is not very clear to me why. Other than that, nicely done, LGTM 👍


void recvPush(faabric::transport::Message& body);
std::unique_ptr<google::protobuf::Message> recvPull(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange that sometimes we use our own Message wrapper, and other times' protobuf's.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point, I'll revisit this and see if I can standardise it a bit. I'd like the rule to be: our Message wrapper inside the transport layer, and protobuf/ flatbuffers everywhere else.

faabric::transport::Message& header,
faabric::transport::Message& body) = 0;

void sendSyncResponse(google::protobuf::Message* resp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we recv a faabric::transport::Message yet send a google::protobuf::Message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, as said above I'll look at this and standardise the interface.

include/faabric/util/latch.h Show resolved Hide resolved
include/faabric/util/latch.h Show resolved Hide resolved
uint8_t* buffer = mb.GetBufferPointer(); \
int size = mb.GetSize(); \
send(buffer, size);
#define SEND_FB_MSG(T, mb) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this live with the other macros? We may eventually need it elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good point.

@@ -3,49 +3,62 @@
#include <faabric/scheduler/SnapshotServer.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments I made to the other server re. function order and return type also apply here.

src/transport/Message.cpp Outdated Show resolved Hide resolved
src/transport/MessageEndpoint.cpp Outdated Show resolved Hide resolved
src/transport/context.cpp Show resolved Hide resolved
@Shillaker Shillaker merged commit 63affca into master Jul 5, 2021
@Shillaker Shillaker deleted the hanging branch July 5, 2021 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants