Fix hanging in transport layer #121

Shillaker · 2021-06-18T12:55:45Z

This PR attempts to fix the issues causing the application to hang indefinitely. This was apparently due to us making some mistakes with the lifecycle of ZeroMQ sockets in our transport layer, and also the use of fragile timeouts in our tests.

AFAICT we must ensure that:

The ZeroMQ context outlives the lifetime of all sockets (i.e. sockets must be constructed and destructed after the context has been opened, but before the context is closed). Note that it is insufficient just to close sockets before the context is closed, their destructors must have been called. If this isn't the case, the context shutdown will hang.
Any two local sockets performing communication must not be destructed before the other socket has completed its side of the communcation (e.g. we cannot destruct a push socket before the pull socket has finished its recv call).
We do not rely on the destructors of any global variables, as this violates the first point (a global destructor will be called after we've closed the context). If we do have any sockets (or pointers to them) existing in a global scope, we must force their destructors to be called when we're finished with them.
We avoid sleeps to reason about threaded operations wherever possible, instead using latches or other synchronisation primitives. If we absolutely must have a sleep, we need to wrap it in a retry to tolerate the underlying operations being several times slower in different environments (i.e. slow CI servers).

Previously we were creating multiple contexts which made it difficult to reason about these rules, so this PR switches to using only one. We were using ephemeral sockets which would occasionally break the second rule, i.e. be destructed before their communication was complete. Finally we had the ZeroMQ contexts living in a global scope, as well as sockets related to MPI living in a thread-local global scope; these could not be removed, but we now make sure they are cleared up properly.

The changes in this PR are:

Switch to using cppzmq RAII semantics for managing socket and context shutdown. Previously we were manually closing down sockets, but the library will handle this automatically.
Use REQ/REP sockets where we need synchronous client/ server behaviour, and PULL/PUSH where we just need to send and forget. Previously we were only using PULL/PUSH and rolling our own synchronous response mechanism which necessitated an ephemeral PULL socket on the client. This caused port conflicts when clients were running concurrently.
Change MessageEndpointServer and MessageEndpointClient to have two sockets each, one for async, one for sync.
Go from two types of endpoint (SendMessageEndpoint/ RecvMessageEndpoint) to four (SyncSendMessageEndpoint, AsyncSendMessageEndpoint, SyncRecvMessageEndpoint, AsyncRecvMessageEndpoint).
Subclasses of MessageEndpointServer now have to implement doAsyncRecv and doSyncRecv accordingly. doSyncRecv returns the response message. Subclasses of MessageEndpointClient can call syncSend and asyncSend, with syncSend taking a pointer to a response which gets populated.
Use a single, global message context only ever opened and closed once and used by all threads. Previously we were creating and destroying contexts regularly and using this as a signal for when things were shut down.
Use an explicit shutdown message to close a server. This makes it easier to trace where and when servers get closed.
Remove MessageContext wrapper object and just use the basic zmq::context_t. Without the need to use contexts for closing things down, we just need a shared pointer to the global context.
Add a linger to every socket, this avoids one source of infinite hangs when sockets are still in a valid state with outstanding messages.
Remove the MessageEndpoint::open method and merge the opening of the socket into the constructor (it was always called straight afterwards anyway).
Remove the need for "persistent" Messages by copying the message body in the snapshot server. This simplifies the Message class and means we can use a stdlib vector rather than a raw pointer, which removes complexity around message lifecycle. We can introduce a zero-copy recv approach for the snapshot server in a later PR.
Standardise error checking around zmq operations in the MessageEndpoint subclasses.
Remove the DummyStateServer used in tests, moved to a simpler test fixture in a single test file. This removes another moving part in the tests (and makes debugging the state tests much easier).
Remove sleeps where possible, replacing with a latch. Wrap any remaining sleeps in at least one retry, including the new REQUIRE_RETRY macro used in tests.
Change the existing Barrier class to a simpler Latch shared between threads via a shared_ptr.

MPI-specific:

Move sending and receiving of the rank-to-host mapping into the MpiWorld class and avoid ephemeral sockets.
Split the broadcasting of rank-to-host messages into a separate function. Previously this was rolled into the world creation, and would still send messages in tests even when mock mode was on. The amount of messaging done implicitly in the world creation was getting difficult to manage, so splitting this out also makes it clearer.
Refactor the MPI "remote" server tests, which were hanging regularly and using slightly confusing terminology around "remote" vs. "local", when in fact both worlds are running locally (one is just in a different thread).

We now manage the zmq context in a global shared_ptr which works fine until it comes to destruction. It seems that cppzmq doesn't like being destructed so late in the application lifecycle and prints an error message along the lines of:

Invalid argument ... (.../mutex.hpp:123)

The solution to this is to explicitly close the context from the main method. This means we'll have to add the opening and closing of the zmq context to any main method using Faabric for now. I've added a check in getGlobalMessageContext which errors if it hasn't been initialised, which should prompt us into adding it where necessary.

include/faabric/transport/MessageEndpoint.h

Shillaker · 2021-06-21T13:48:23Z

include/faabric/transport/MessageEndpointServer.h

@@ -67,6 +67,8 @@ class MessageEndpointServer
  private:
    const int port;

+    std::unique_ptr<RecvMessageEndpoint> endpoint = nullptr;


This fell out of the refactor away from using the message context to shut down. Now the endpoint server can manage its own endpoint internally.

src/transport/MessageEndpoint.cpp

Shillaker · 2021-06-22T16:01:23Z

src/transport/MessageEndpoint.cpp

-        } catch (zmq::error_t& e) {
-            SPDLOG_ERROR("Error sending message: {}", e.what());
-            throw;
-        }


This if/ else was essentially doing the same send operation in both branches, just with different send_flags, so I merged it into one.

Shillaker · 2021-06-22T16:02:36Z

src/transport/MessageEndpoint.cpp

-            // Print default message and rethrow
-            SPDLOG_ERROR("Error receiving message: {} ({})", e.num(), e.what());
-            throw;
-        }


To make things simpler to parse I split recv up into two methods based on whether a size is provided.

Shillaker · 2021-06-28T11:28:26Z

src/state/StateKeyValue.cpp

@@ -559,7 +559,7 @@ uint32_t StateKeyValue::waitOnRedisRemoteLock(const std::string& redisKey)
            break;
        }

-        std::this_thread::sleep_for(std::chrono::milliseconds(1));
+        SLEEP_MS(1);


I was dealing with so many sleeps and writing usleep(123 * 1000) that I made a macro.

tests/test/scheduler/test_mpi_world.cpp

Shillaker · 2021-06-29T13:49:24Z

include/faabric/transport/MessageEndpointServer.h


-    virtual void stop();
+    void awaitAsyncLatch();


These setAsyncLatch and awaitAsyncLatch functions are to support testing. Tests frequently check the side effects of sending an async message by sending it, then performing some arbitrary sleep before checking that it's done. This allows us to avoid the sleep and the inherent flakiness of that approach.

csegarragonz

What a beast of a PR, and what a grueling effort as well! Fingers crossed we can now stop re-running the tests until they pass 🤣

Seeing this finished, the changes are actually quite easy to follow.

I've left very minor comments. I guess my biggest concern is with managing a lot of google::protobuf::Message objects, which is not very clear to me why. Other than that, nicely done, LGTM 👍

csegarragonz · 2021-07-01T14:52:05Z

include/faabric/state/StateServer.h


-    void recvPush(faabric::transport::Message& body);
+    std::unique_ptr<google::protobuf::Message> recvPull(


It feels strange that sometimes we use our own Message wrapper, and other times' protobuf's.

Yes that's a good point, I'll revisit this and see if I can standardise it a bit. I'd like the rule to be: our Message wrapper inside the transport layer, and protobuf/ flatbuffers everywhere else.

include/faabric/transport/MessageEndpointServer.h

csegarragonz · 2021-07-01T14:59:04Z

include/faabric/transport/MessageEndpointServer.h

+      faabric::transport::Message& header,
+      faabric::transport::Message& body) = 0;
+
+    void sendSyncResponse(google::protobuf::Message* resp);


Why do we recv a faabric::transport::Message yet send a google::protobuf::Message?

Good point, as said above I'll look at this and standardise the interface.

include/faabric/util/latch.h

csegarragonz · 2021-07-01T15:29:02Z

src/scheduler/SnapshotClient.cpp

-    uint8_t* buffer = mb.GetBufferPointer();                                   \
-    int size = mb.GetSize();                                                   \
-    send(buffer, size);
+#define SEND_FB_MSG(T, mb)                                                     \


Couldn't this live with the other macros? We may eventually need it elsewhere.

Yep, good point.

csegarragonz · 2021-07-01T15:33:46Z

src/scheduler/SnapshotServer.cpp

@@ -3,49 +3,62 @@
 #include <faabric/scheduler/SnapshotServer.h>


Same comments I made to the other server re. function order and return type also apply here.

src/transport/Message.cpp

src/transport/MessageEndpoint.cpp

src/transport/context.cpp

Shillaker added 5 commits June 18, 2021 10:52

Increase wait time on flush test

6860a6d

Small tidy-up in endpoint code

bac7026

Add timeout on close and catch-all error handling

7a183a1

Formatting

4dfc42d

Lazy-init global message context and don't shut down

1f5f000

Shillaker self-assigned this Jun 18, 2021

Shillaker added 2 commits June 18, 2021 13:44

Remove need to close global message context

d38dd03

Fix up timeout test

9cb6b89

Shillaker changed the title ~~Fixes for transport shutdown hanging~~ Simplifying transport layer shutdown Jun 18, 2021

Removed catch-all error handling

db3ad14

Shillaker commented Jun 21, 2021

View reviewed changes

include/faabric/transport/MessageEndpoint.h Show resolved Hide resolved

Shillaker commented Jun 21, 2021

View reviewed changes

Shillaker added 6 commits June 21, 2021 14:14

Remove message context argument from open and close

c746876

Use stlib in Message object

b985bc7

Remove MessageContext object

f40e12e

Make SocketType a property on the message endpoint

9ee0e11

Error checking around use of send and recv sockets

9a3349a

Remove the need for persist on messages

880af93

Shillaker commented Jun 22, 2021

View reviewed changes

src/transport/MessageEndpoint.cpp Show resolved Hide resolved

Shillaker commented Jun 22, 2021

View reviewed changes

Shillaker added 9 commits June 22, 2021 16:50

Restore simple class wrapper around context for shutdown

4b6fd9e

Close context from main thread

d4e8f41

Switch to raii sockets

2ee0ef4

Remove MessageEndpointClient

c2af35e

Explicitly open and close message context

d038e8e

Simplify global context

317bf5e

Reinstate timeout test

b3bcc5d

Move sendResponse into RecvMessageEndpoint

1daeb8e

Default reply port

d4be278

Shillaker commented Jun 28, 2021

View reviewed changes

tests/test/scheduler/test_mpi_world.cpp Show resolved Hide resolved

Shillaker changed the title ~~Simplifying transport layer~~ Fix hanging in transport layer Jun 28, 2021

Shillaker added 4 commits June 28, 2021 14:21

Self review and restart dummy state server on initial error

6e089d0

Remove unnecessary threads in transport tests

ec9d58d

Lengthen all timeouts

f9d45ba

Retry connecting socket

e267134

Shillaker marked this pull request as draft June 28, 2021 17:59

Shillaker added 6 commits June 28, 2021 17:59

Avoid arbitrary sleeps in tests

8bb3888

Add timeout on barrier

6bebb61

Long-lived shutdown endpoints in server

51fd3d4

Added latch to async server

4de6f27

Guard against null pointers and avoid memcpying

9b0cbaf

Share latch via shared pointer

ff00429

Shillaker commented Jun 29, 2021

View reviewed changes

Typos

413ce6e

Shillaker marked this pull request as ready for review June 29, 2021 13:50

Shillaker added 5 commits June 29, 2021 13:59

Link util to transport

4e6dbf2

Move global message context handling out of FaabricMain

b4d8cc7

Add retry logic in servers, unify server threads into signle class

7584527

Fix port mixup and remove unused macros

19a7d6e

Remove use thread-local cache in scheduler

bf6b15c

Shillaker requested a review from csegarragonz July 1, 2021 10:45

csegarragonz requested changes Jul 1, 2021

View reviewed changes

Shillaker added 4 commits July 5, 2021 08:30

Rename retry macro

b3327e9

Move latch constructor

fd9ad8a

Move FB macro

1f71e38

Message server interface to buffers

8d7565d

Shillaker merged commit 63affca into master Jul 5, 2021

Shillaker deleted the hanging branch July 5, 2021 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hanging in transport layer #121

Fix hanging in transport layer #121

Shillaker commented Jun 18, 2021 •

edited

Loading

Shillaker Jun 21, 2021

Shillaker Jun 22, 2021

Shillaker Jun 22, 2021

Shillaker Jun 28, 2021

Shillaker Jun 29, 2021 •

edited

Loading

csegarragonz left a comment

csegarragonz Jul 1, 2021

Shillaker Jul 1, 2021

csegarragonz Jul 1, 2021

Shillaker Jul 1, 2021

csegarragonz Jul 1, 2021

Shillaker Jul 1, 2021

csegarragonz Jul 1, 2021


		void recvPush(faabric::transport::Message& body);
		std::unique_ptr<google::protobuf::Message> recvPull(

		@@ -3,49 +3,62 @@
		#include <faabric/scheduler/SnapshotServer.h>

Fix hanging in transport layer #121

Fix hanging in transport layer #121

Conversation

Shillaker commented Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shillaker Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

csegarragonz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shillaker commented Jun 18, 2021 •

edited

Loading

Shillaker Jun 29, 2021 •

edited

Loading