[FLINK-22001] Forward exceptions thrown during initialization to the user #15577

rmetzger · 2021-04-12T12:48:55Z

What is the purpose of the change

Exceptions during the initialization (thrown by the JobMaster) were not forwarded to the user. This is now guarded by the JobMasterITCase.testJobManagerInitializationExceptionsAreForwardedToTheUser().
The initialization of the JobMaster (including the scheduler initialization, which runs the source coordinators and other user code) was executed in the JobManager main thread, potentially blocking the JobManager in case of blocking calls in the JobMaster init.

Open TODOs: (due to the amount of changes, and the time spend on the issue already, I'd like to get some initial feedback on the change)

The problem that the JobMaster initialization was executed in the main thread executor was not, and is still not guarded by a test.
The newly introduced JobMasterITCase.testJobManagerInitializationExceptionsAreForwardedToTheUser() is not nice: It relies on an implementation detail.

Brief change log

Introduced JobManagerStatusListener to JobManagerRunner
Reworked the DispatcherJob to track the initialization based on the signals from the JobManagerStatusListener.
Introduced a DispatcherJobStatus class with tracks the status along with the JobManagerRunner future (they are updated together, thus we track them in one class)
Adjusted the DispatcherTests because initialization completion is now dependent on getting leadership.

Verifying this change

Added some tests, adjusted some.

flinkbot · 2021-04-12T12:52:30Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 100c6c5 (Fri May 28 09:10:53 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-04-12T13:05:59Z

CI report:

100c6c5 Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

tillrohrmann · 2021-04-14T13:04:52Z

For the review of the PR it would be great to also capture somewhere the description and analysis of the problem. I know we talked about it, but I already forgot the details ;-)

tillrohrmann · 2021-04-14T17:35:34Z

The problem seems to be the following: Since we create the JobMasterService lazily it can happen that the DispacherJob is in the initialized state (JobManagerRunner being created) but the JobMasterService is not running/has not been created. If now the client polls the DispatcherJob.requestJobStatus(), the system will ask the JobManagerRunner.getJobMasterGateway().requestJob(). The JobManagerRunner.getJobMasterGateway might not be completed.

If now the JobMasterService creation fails, then the JobManagerRunnerImpl will complete the resultFuture which leads to the shut down of the DispatcherJob and then also the JobManagerRunnerImpl. Due to this shut down, the system will complete the leaderGatewayFuture exceptionally which causes the initial DispatcherJob.requestJobStatus to fail with org.apache.flink.util.FlinkException: JobMaster has been shut down..

tillrohrmann

Thanks for creating this PR @rmetzger. I fear that this PR is not going in the right direction, tbh.

I think the underlying problem is that you try to mirror the state of the JobManagerRunner in the DispatcherJob which overly couples these components and possibly not really solves the underlying problem.

If you think about why we introduced the DispatcherJob, then it was to handle the asynchronous creation of the JobManagerRunner. Now the JobManagerRunner can be created synchronously and the asynchronous generation of the JobMasterService has been moved to the JobManagerRunner implementation. Hence, the DispatcherJob is not really doing anything anymore because the JobManagerRunner is always created.

However, what is not created is the JobMasterService. Hence, I would suggest to move the DispatcherJob logic which manages the job accesses when the job is not yet running into the JobManagerRunnerImpl. By doing these accesses under the same lock as the leadership operations, we can ensure that we have a consistent view of the state of the job.

What we probably have to change in the JobManagerRunnerImpl, however, is that we don't create the JobMasterService under the same lock. Otherwise we risk that the Dispatcher's main thread might be blocked while we create the JobMasterService.

tillrohrmann · 2021-04-14T13:32:30Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/DispatcherJob.java

+                                    // This DispatcherJob is pending a termination. Forward
+                                    // cancellation result.
+                                    FutureUtils.forward(
+                                            cancelFuture.thenCompose((ign) -> null),


I think this is not working and produces a NPE. Did you want to use cancelFuture.thenAccept()?

tillrohrmann · 2021-04-14T13:35:55Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/DispatcherJobStatus.java

+
+        @Nullable private final JobStatus jobStatus;
+
+        Status(JobStatus jobStatus) {


@Nullable is missing.

tillrohrmann · 2021-04-14T13:39:31Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

+                    leadershipOperation.thenRunAsync(
                            ThrowingRunnable.unchecked(
                                    () -> {
                                        synchronized (lock) {
                                            verifyJobSchedulingStatusAndStartJobManager(
                                                    leaderSessionID);
                                        }
-                                    }));
+                                    }),
+                            executor); // run in separate thread to not block main thread on
+            // JobManager initialization.


How does it happen that the main thread executor runs this part of the code?

tillrohrmann · 2021-04-14T13:51:09Z

flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherJobTest.java

+
+        assertThat(dispatcherJob.requestJobStatus(TIMEOUT).get(), is(JobStatus.INITIALIZING));
+
+        testContext.setRunning();


Why is this necessary?

tillrohrmann · 2021-04-14T13:51:28Z

flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherJobTest.java

+    }
+
+    @Test
+    public void testLeadershipLoss() throws Exception {


It would be good to add what should happen under leadership loss.

tillrohrmann · 2021-04-14T17:42:43Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

+                DispatcherJob.createFor(jobId, jobGraph.getName(), initializationTimestamp);
+        runningJobs.put(jobId, dispatcherJob);


It feels odd that we first create the DispatcherJob and then tell him about the JobManagerRunner through some callback. Why not first creating the JobManagerRunner and then giving it to the DispatcherJob?

tillrohrmann · 2021-04-14T17:43:30Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/DispatcherJob.java

+    // if the termination future is set, we are signaling that this DispatcherJob is closing / has
+    // been closed


This is not the embodiment of an expressive and clear contract.

tillrohrmann · 2021-04-14T17:55:42Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/DispatcherJob.java

+            return jobStatus.isJobManagerCreatedOrFailed();
+        }
+    }
+
    /** Returns a future completing to the ExecutionGraphInfo of the job. */
    public CompletableFuture<ExecutionGraphInfo> requestJob(Time timeout) {


I think this PR effectively suffers from the same problems as the existing code. What happens if the client polling happens exactly after onJobManagerStarted has been called. In this case, we will call getJobMasterGateway().thenCompose(gateway -> gateway.requestJob(timeout)). If the leader loses leadership before the call can be executed (note that the leadership operation of the JobManagerRunner runs in a separate thread), then it can happen that this call will never be answered or that you get a fresh leaderGatewayFuture which in the future might be completed exceptionally if, for example, the JobMasterService cannot be instantiated.

I think the underlying problem is that you try to maintain a consistent view of the JobManagerRunner within the DispatcherJob but the former runs independent of the latter.

tillrohrmann · 2021-04-14T17:59:03Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/DispatcherJobStatus.java

+public class DispatcherJobStatus {
+    private Status status;
+
+    private CompletableFuture<JobManagerRunner> jobManagerRunnerFuture = new CompletableFuture<>();


Why does this need to be a future? If we transition into the Status.JOB_MANAGER_CREATED_OR_INIT_FAILED, then the JobManagerRunner should be known.

tillrohrmann · 2021-04-14T18:00:59Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

+            // Execute listener notification asynchronously in the main thread executor to make sure
+            // this "grant leadership" operation is properly completed before the next operation is
+            // started. With the current implementation DispatcherJob might call closeAsync()
+            // leading to concurrent access under the "lock".
+            FutureUtils.assertNoException(
+                    CompletableFuture.runAsync(
+                            () -> jobManagerStatusListener.onJobManagerStarted(this),
+                            mainThreadExecutor));


I think there is design problem here. Somehow the JobManagerRunner needs to know that some calls have to be executed by the mainThreadExecutor. This looks like implementation details which are leaked from the DispatcherJob into the JobManagerRunnerImpl.

rmetzger · 2021-04-15T07:56:27Z

Thanks a lot for your review. Next time, I'll write the issue analysis into the Jira (as I usually do).

I agree with your observation that we should remove DispatcherJob and move the logic from there into JobManagerRunnerImpl. I will rewrite the PR.

rmetzger · 2021-04-16T11:56:02Z

As discussed offline, this is a rough first draft of the revised JobManagerRunnerImpl, pretty much untested.
I'll now start adding tests.

rmetzger · 2021-04-19T11:45:21Z

I've now pushed a version of the change that should have all tests passing, and have a pretty solid test coverage. Looking forward to your feedback @tillrohrmann !

tillrohrmann

Thanks for updating this PR @rmetzger. I think it goes in a very good direction. Moving the responsibility of the DispatcherJob into the JobManagerRunnerImpl is a good decision imo because it allows us to have a consistent view of the actual state.

I had a couple of comments concerning the current implementation I think that some cases and the semantics of some fields are not fully defined yet. I also think that it could help to make the different states of the JobManagerRunnerImpl more explicit and to define how the component should behave in which situation. I haven't managed to create a draft for such a state machine yet but that's what I will continue tomorrow morning with.

tillrohrmann · 2021-04-19T11:21:11Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunner.java

+    CompletableFuture<Acknowledge> cancel(Time timeout);
+
+    CompletableFuture<JobStatus> requestJobStatus(Time timeout);
+
+    CompletableFuture<JobDetails> requestJobDetails(Time timeout);
+
+    CompletableFuture<ExecutionGraphInfo> requestJob(Time timeout);
+
+    boolean isInitialized();


JavaDocs would be nice.

tillrohrmann · 2021-04-19T11:25:45Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

+                        checkState(cancelFuture != null);
+                        cancelFuture.completeExceptionally(shutdownException);


Why don't we have to close the leaderElectionService in case of a cancellation? Same for the resultFuture. What is the semantics for the resultFuture when the JobManagerRunner is cancelled?

tillrohrmann · 2021-04-19T11:27:48Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

+                            jobMasterService
+                                    .closeAsync()
+                                    .whenComplete(
+                                            (Void ignored, Throwable throwable) ->
+                                                    onJobManagerTermination(throwable)));


For chaining of shutdown operations there are some utilities like FutureUtils.runAfterwards or so. This makes the handling of exception occurring in the individual steps easier because they are automatically added as suppressed exceptions.

tillrohrmann · 2021-04-19T11:29:09Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

-                                                t,
-                                                ExceptionUtils.stripCompletionException(throwable));
-                            }
+            return terminationFuture;


Nit: I think it would be better if the completion of the terminationFuture happens in this method. E.g. via using FutureUtils.forward(shutdownFuture, terminationFuture). That way one does not have to look around where the termination future is actually completed.

tillrohrmann · 2021-04-19T11:30:25Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

+                        // no ongoing JobMaster initialization (waiting for leadership) --> close
+                        onJobManagerTermination(null);
+                    }
+                    // ongoing initialization, we will finish closing once it is done.


Just looking at this method, it is not very clear to me how the terminationFuture is completed in case of an ongoing initialization.

tillrohrmann · 2021-04-19T14:52:15Z

flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImplTest.java

+            Preconditions.checkState(
+                    !blocker.isTriggered(),
+                    "This only works before the initialization has been unblocked");
+            this.initializationException =


If you set things from another thread, then we either need at least a volatile or some other form of synchronization. An alternative to setting these kind of fields while the instance is being used is to configure it when creating an instance of this class.

tillrohrmann · 2021-04-19T15:06:17Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

@@ -101,6 +105,15 @@

    private volatile CompletableFuture<JobMasterGateway> leaderGatewayFuture;


I would suggest to properly annotate all fields in this class now with their guard.

tillrohrmann · 2021-04-19T15:12:35Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

@@ -348,17 +462,76 @@ private void startJobMaster(UUID leaderSessionId) throws FlinkException {
                    e);
        }


Wouldn't this also qualify as an initialization failure if we cannot set the correct state in the runningJobsRegistry?

tillrohrmann · 2021-04-19T15:14:13Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

+                                        new FlinkException(
+                                                "Cancellation failed because JobMaster initialization failed"));
+                            }
+                            jobStatus = JobManagerRunnerJobStatus.JOBMASTER_INITIALIZED;


Why is the JOBMASTER_INITIALIZED if its creation failed?

tillrohrmann · 2021-04-19T15:29:37Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunnerImpl.java

-                    getDescription());
-        }
-    }
-
    @Override
    public void revokeLeadership() {


My gut feeling is that we need to integrate the leadership calls (grant + revoke) into the internal state management of the JobManagerRunnerImpl. For example, after cancelling the JobManagerRunner before a leadership operation has started, it should probably not accept a grantLeadership call.

rmetzger · 2021-04-20T06:36:43Z

Thanks a lot for your review! Your comments are very helpful!

I thought about representing the states in the Runner more nicely as well, as there are quite a few fields that are only set in certain conditions (cancellation during initialization), and many methods distinguish between two cases (initialized vs initializing). I briefly tried setting up State classes, but it felt that I had to carry a lot of stuff around. Not sure if the pattern with the context used in Adaptive Scheduler isn't a bit of overkill here.
I also realize that some states are still not well defined (initialization error) in the current implementation.

rmetzger · 2021-04-22T06:14:36Z

Closing in favor of #15715.

rmetzger added review=description? component=Runtime/Coordination labels Apr 12, 2021

tillrohrmann self-assigned this Apr 14, 2021

tillrohrmann requested changes Apr 14, 2021

View reviewed changes

rmetzger force-pushed the FLINK-22001-forward-jobmaster-exception branch from 0ace107 to c5b5770 Compare April 16, 2021 11:53

[FLINK-22001] Fix forwarding of JobMaster exceptions to user

100c6c5

rmetzger force-pushed the FLINK-22001-forward-jobmaster-exception branch from 9725f3a to 100c6c5 Compare April 19, 2021 11:44

tillrohrmann reviewed Apr 19, 2021

View reviewed changes

rmetzger closed this Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-22001] Forward exceptions thrown during initialization to the user #15577

[FLINK-22001] Forward exceptions thrown during initialization to the user #15577

rmetzger commented Apr 12, 2021

flinkbot commented Apr 12, 2021 •

edited

Loading

flinkbot commented Apr 12, 2021 •

edited

Loading

tillrohrmann commented Apr 14, 2021

tillrohrmann commented Apr 14, 2021

tillrohrmann left a comment

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

tillrohrmann Apr 14, 2021

rmetzger commented Apr 15, 2021

rmetzger commented Apr 16, 2021

rmetzger commented Apr 19, 2021

tillrohrmann left a comment

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

tillrohrmann Apr 19, 2021

rmetzger commented Apr 20, 2021

rmetzger commented Apr 22, 2021


		@Nullable private final JobStatus jobStatus;

		Status(JobStatus jobStatus) {


		assertThat(dispatcherJob.requestJobStatus(TIMEOUT).get(), is(JobStatus.INITIALIZING));

		testContext.setRunning();

		DispatcherJob.createFor(jobId, jobGraph.getName(), initializationTimestamp);
		runningJobs.put(jobId, dispatcherJob);

		// if the termination future is set, we are signaling that this DispatcherJob is closing / has
		// been closed

		checkState(cancelFuture != null);
		cancelFuture.completeExceptionally(shutdownException);

		@@ -101,6 +105,15 @@

		private volatile CompletableFuture<JobMasterGateway> leaderGatewayFuture;

		@@ -348,17 +462,76 @@ private void startJobMaster(UUID leaderSessionId) throws FlinkException {
		e);
		}

[FLINK-22001] Forward exceptions thrown during initialization to the user #15577

[FLINK-22001] Forward exceptions thrown during initialization to the user #15577

Conversation

rmetzger commented Apr 12, 2021

What is the purpose of the change

Brief change log

Verifying this change

flinkbot commented Apr 12, 2021 • edited Loading

Automated Checks

Review Progress

flinkbot commented Apr 12, 2021 • edited Loading

CI report:

tillrohrmann commented Apr 14, 2021

tillrohrmann commented Apr 14, 2021

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmetzger commented Apr 15, 2021

rmetzger commented Apr 16, 2021

rmetzger commented Apr 19, 2021

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmetzger commented Apr 20, 2021

rmetzger commented Apr 22, 2021

flinkbot commented Apr 12, 2021 •

edited

Loading

flinkbot commented Apr 12, 2021 •

edited

Loading