Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-22001] Forward exceptions thrown during initialization to the user #15577

Conversation

rmetzger
Copy link
Contributor

What is the purpose of the change

  • Exceptions during the initialization (thrown by the JobMaster) were not forwarded to the user. This is now guarded by the JobMasterITCase.testJobManagerInitializationExceptionsAreForwardedToTheUser().
  • The initialization of the JobMaster (including the scheduler initialization, which runs the source coordinators and other user code) was executed in the JobManager main thread, potentially blocking the JobManager in case of blocking calls in the JobMaster init.

Open TODOs: (due to the amount of changes, and the time spend on the issue already, I'd like to get some initial feedback on the change)

  • The problem that the JobMaster initialization was executed in the main thread executor was not, and is still not guarded by a test.
  • The newly introduced JobMasterITCase.testJobManagerInitializationExceptionsAreForwardedToTheUser() is not nice: It relies on an implementation detail.

Brief change log

  • Introduced JobManagerStatusListener to JobManagerRunner
  • Reworked the DispatcherJob to track the initialization based on the signals from the JobManagerStatusListener.
  • Introduced a DispatcherJobStatus class with tracks the status along with the JobManagerRunner future (they are updated together, thus we track them in one class)
  • Adjusted the DispatcherTests because initialization completion is now dependent on getting leadership.

Verifying this change

Added some tests, adjusted some.

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 12, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 100c6c5 (Fri May 28 09:10:53 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 12, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@tillrohrmann tillrohrmann self-assigned this Apr 14, 2021
@tillrohrmann
Copy link
Contributor

For the review of the PR it would be great to also capture somewhere the description and analysis of the problem. I know we talked about it, but I already forgot the details ;-)

@tillrohrmann
Copy link
Contributor

The problem seems to be the following: Since we create the JobMasterService lazily it can happen that the DispacherJob is in the initialized state (JobManagerRunner being created) but the JobMasterService is not running/has not been created. If now the client polls the DispatcherJob.requestJobStatus(), the system will ask the JobManagerRunner.getJobMasterGateway().requestJob(). The JobManagerRunner.getJobMasterGateway might not be completed.

If now the JobMasterService creation fails, then the JobManagerRunnerImpl will complete the resultFuture which leads to the shut down of the DispatcherJob and then also the JobManagerRunnerImpl. Due to this shut down, the system will complete the leaderGatewayFuture exceptionally which causes the initial DispatcherJob.requestJobStatus to fail with org.apache.flink.util.FlinkException: JobMaster has been shut down..

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @rmetzger. I fear that this PR is not going in the right direction, tbh.

I think the underlying problem is that you try to mirror the state of the JobManagerRunner in the DispatcherJob which overly couples these components and possibly not really solves the underlying problem.

If you think about why we introduced the DispatcherJob, then it was to handle the asynchronous creation of the JobManagerRunner. Now the JobManagerRunner can be created synchronously and the asynchronous generation of the JobMasterService has been moved to the JobManagerRunner implementation. Hence, the DispatcherJob is not really doing anything anymore because the JobManagerRunner is always created.

However, what is not created is the JobMasterService. Hence, I would suggest to move the DispatcherJob logic which manages the job accesses when the job is not yet running into the JobManagerRunnerImpl. By doing these accesses under the same lock as the leadership operations, we can ensure that we have a consistent view of the state of the job.

What we probably have to change in the JobManagerRunnerImpl, however, is that we don't create the JobMasterService under the same lock. Otherwise we risk that the Dispatcher's main thread might be blocked while we create the JobMasterService.

// This DispatcherJob is pending a termination. Forward
// cancellation result.
FutureUtils.forward(
cancelFuture.thenCompose((ign) -> null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not working and produces a NPE. Did you want to use cancelFuture.thenAccept()?


@Nullable private final JobStatus jobStatus;

Status(JobStatus jobStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nullable is missing.

Comment on lines 313 to 322
leadershipOperation.thenRunAsync(
ThrowingRunnable.unchecked(
() -> {
synchronized (lock) {
verifyJobSchedulingStatusAndStartJobManager(
leaderSessionID);
}
}));
}),
executor); // run in separate thread to not block main thread on
// JobManager initialization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it happen that the main thread executor runs this part of the code?


assertThat(dispatcherJob.requestJobStatus(TIMEOUT).get(), is(JobStatus.INITIALIZING));

testContext.setRunning();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

}

@Test
public void testLeadershipLoss() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add what should happen under leadership loss.

Comment on lines 402 to 403
DispatcherJob.createFor(jobId, jobGraph.getName(), initializationTimestamp);
runningJobs.put(jobId, dispatcherJob);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels odd that we first create the DispatcherJob and then tell him about the JobManagerRunner through some callback. Why not first creating the JobManagerRunner and then giving it to the DispatcherJob?

Comment on lines 63 to 64
// if the termination future is set, we are signaling that this DispatcherJob is closing / has
// been closed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the embodiment of an expressive and clear contract.

return jobStatus.isJobManagerCreatedOrFailed();
}
}

/** Returns a future completing to the ExecutionGraphInfo of the job. */
public CompletableFuture<ExecutionGraphInfo> requestJob(Time timeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR effectively suffers from the same problems as the existing code. What happens if the client polling happens exactly after onJobManagerStarted has been called. In this case, we will call getJobMasterGateway().thenCompose(gateway -> gateway.requestJob(timeout)). If the leader loses leadership before the call can be executed (note that the leadership operation of the JobManagerRunner runs in a separate thread), then it can happen that this call will never be answered or that you get a fresh leaderGatewayFuture which in the future might be completed exceptionally if, for example, the JobMasterService cannot be instantiated.

I think the underlying problem is that you try to maintain a consistent view of the JobManagerRunner within the DispatcherJob but the former runs independent of the latter.

public class DispatcherJobStatus {
private Status status;

private CompletableFuture<JobManagerRunner> jobManagerRunnerFuture = new CompletableFuture<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a future? If we transition into the Status.JOB_MANAGER_CREATED_OR_INIT_FAILED, then the JobManagerRunner should be known.

Comment on lines 384 to 391
// Execute listener notification asynchronously in the main thread executor to make sure
// this "grant leadership" operation is properly completed before the next operation is
// started. With the current implementation DispatcherJob might call closeAsync()
// leading to concurrent access under the "lock".
FutureUtils.assertNoException(
CompletableFuture.runAsync(
() -> jobManagerStatusListener.onJobManagerStarted(this),
mainThreadExecutor));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is design problem here. Somehow the JobManagerRunner needs to know that some calls have to be executed by the mainThreadExecutor. This looks like implementation details which are leaked from the DispatcherJob into the JobManagerRunnerImpl.

@rmetzger
Copy link
Contributor Author

Thanks a lot for your review. Next time, I'll write the issue analysis into the Jira (as I usually do).

I agree with your observation that we should remove DispatcherJob and move the logic from there into JobManagerRunnerImpl. I will rewrite the PR.

@rmetzger rmetzger force-pushed the FLINK-22001-forward-jobmaster-exception branch from 0ace107 to c5b5770 Compare April 16, 2021 11:53
@rmetzger
Copy link
Contributor Author

As discussed offline, this is a rough first draft of the revised JobManagerRunnerImpl, pretty much untested.
I'll now start adding tests.

@rmetzger rmetzger force-pushed the FLINK-22001-forward-jobmaster-exception branch from 9725f3a to 100c6c5 Compare April 19, 2021 11:44
@rmetzger
Copy link
Contributor Author

I've now pushed a version of the change that should have all tests passing, and have a pretty solid test coverage. Looking forward to your feedback @tillrohrmann !

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this PR @rmetzger. I think it goes in a very good direction. Moving the responsibility of the DispatcherJob into the JobManagerRunnerImpl is a good decision imo because it allows us to have a consistent view of the actual state.

I had a couple of comments concerning the current implementation I think that some cases and the semantics of some fields are not fully defined yet. I also think that it could help to make the different states of the JobManagerRunnerImpl more explicit and to define how the component should behave in which situation. I haven't managed to create a draft for such a state machine yet but that's what I will continue tomorrow morning with.

Comment on lines 64 to 101
CompletableFuture<Acknowledge> cancel(Time timeout);

CompletableFuture<JobStatus> requestJobStatus(Time timeout);

CompletableFuture<JobDetails> requestJobDetails(Time timeout);

CompletableFuture<ExecutionGraphInfo> requestJob(Time timeout);

boolean isInitialized();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaDocs would be nice.

Comment on lines +226 to +227
checkState(cancelFuture != null);
cancelFuture.completeExceptionally(shutdownException);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have to close the leaderElectionService in case of a cancellation? Same for the resultFuture. What is the semantics for the resultFuture when the JobManagerRunner is cancelled?

Comment on lines +219 to +223
jobMasterService
.closeAsync()
.whenComplete(
(Void ignored, Throwable throwable) ->
onJobManagerTermination(throwable)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For chaining of shutdown operations there are some utilities like FutureUtils.runAfterwards or so. This makes the handling of exception occurring in the individual steps easier because they are automatically added as suppressed exceptions.

t,
ExceptionUtils.stripCompletionException(throwable));
}
return terminationFuture;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be better if the completion of the terminationFuture happens in this method. E.g. via using FutureUtils.forward(shutdownFuture, terminationFuture). That way one does not have to look around where the termination future is actually completed.

// no ongoing JobMaster initialization (waiting for leadership) --> close
onJobManagerTermination(null);
}
// ongoing initialization, we will finish closing once it is done.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at this method, it is not very clear to me how the terminationFuture is completed in case of an ongoing initialization.

Preconditions.checkState(
!blocker.isTriggered(),
"This only works before the initialization has been unblocked");
this.initializationException =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set things from another thread, then we either need at least a volatile or some other form of synchronization. An alternative to setting these kind of fields while the instance is being used is to configure it when creating an instance of this class.

@@ -101,6 +105,15 @@

private volatile CompletableFuture<JobMasterGateway> leaderGatewayFuture;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to properly annotate all fields in this class now with their guard.

@@ -348,17 +462,76 @@ private void startJobMaster(UUID leaderSessionId) throws FlinkException {
e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also qualify as an initialization failure if we cannot set the correct state in the runningJobsRegistry?

new FlinkException(
"Cancellation failed because JobMaster initialization failed"));
}
jobStatus = JobManagerRunnerJobStatus.JOBMASTER_INITIALIZED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the JOBMASTER_INITIALIZED if its creation failed?

getDescription());
}
}

@Override
public void revokeLeadership() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that we need to integrate the leadership calls (grant + revoke) into the internal state management of the JobManagerRunnerImpl. For example, after cancelling the JobManagerRunner before a leadership operation has started, it should probably not accept a grantLeadership call.

@rmetzger
Copy link
Contributor Author

Thanks a lot for your review! Your comments are very helpful!

I thought about representing the states in the Runner more nicely as well, as there are quite a few fields that are only set in certain conditions (cancellation during initialization), and many methods distinguish between two cases (initialized vs initializing). I briefly tried setting up State classes, but it felt that I had to carry a lot of stuff around. Not sure if the pattern with the context used in Adaptive Scheduler isn't a bit of overkill here.
I also realize that some states are still not well defined (initialization error) in the current implementation.

@rmetzger
Copy link
Contributor Author

Closing in favor of #15715.

@rmetzger rmetzger closed this Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants