Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress last-ditch download exceptions w/cleanup #10029

Closed
wants to merge 19 commits into from

Conversation

werkt
Copy link
Contributor

@werkt werkt commented Oct 14, 2019

Create an encapsulating DownloadException to represent the aggregate
exception set of an ActionResult download. IOExceptions will be retained
through exception suppression, and the outer exception has a property to
indicate if it only represents a sequence of CacheNotFoundExceptions.

InterruptedExceptions interception is improved to cancel pending work
and wrap, through suppression, any DownloadException that also occurred
during the download. InterruptedException being thrown on the download
control thread, it does not require suppression of further interrupts,
and can represent an outer download exception. Thread interrupt status
is suppressed for cancellations, and conveyed on throw.

These exception wrapping efforts allow non-asynchronous frame
representation in stack traces, and much clearer identification of
sources within remote strategy execution which produce failures based on
remote errors.

Any DownloadException in the last-ditch output download under
handleError in RemoteSpawnRunner is added as suppressed to the
initiating exception. Other exceptions (likely local IO) present clear
immediate traces and do not require specialized treatment.

Create an encapsulating DownloadException to represent the aggregate
exception set of an ActionResult download. IOExceptions will be retained
through exception suppression, and the outer exception has a property to
indicate if it only represents a sequence of CacheNotFoundExceptions.

InterruptedExceptions interception is improved to cancel pending work
and wrap, through suppression, any DownloadException that also occurred
during the download. InterruptedException being thrown on the download
control thread, it does not require suppression of further interrupts,
and can represent an outer download exception. Thread interrupt status
is suppressed for cancellations, and conveyed on throw.

These exception wrapping efforts allow non-asynchronous frame
representation in stack traces, and much clearer identification of
sources within remote strategy execution which produce failures based on
remote errors.

Any DownloadException in the last-ditch output download under
handleError in RemoteSpawnRunner is added as suppressed to the
initiating exception. Other exceptions (likely local IO) present clear
immediate traces and do not require specialized treatment.
@jin jin added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Oct 14, 2019
@hlopko hlopko removed their request for review October 15, 2019 12:23
@buchgr
Copy link
Contributor

buchgr commented Nov 12, 2019

I am not sure how I feel about this change. In part because it seems to be treating a symptom instead of the root cause: If you see a stacktrace from Bazel then I'd consider this a bug. Would you agree?

@werkt
Copy link
Contributor Author

werkt commented Nov 12, 2019

I think that might be an oversimplification of the presentation of a stack trace - fwiw this doesn't change the current behavior of only making these traces visible with --verbose_failures, it simply includes logic that ensures that if a trace is printed as a result of a download failure with --verbose_failures enabled and without --remote_local_fallback, it contains lines that refer to the actual bazel code which invoked the requests which threw exceptions, to provide a bare minimum of identification of what it was doing, as opposed to the grpc-only frames present during client calls.

@buchgr buchgr removed their assignment Jan 9, 2020
@werkt
Copy link
Contributor Author

werkt commented Mar 6, 2020

@jin ping on triage for this

To any future reviewers, this does not add stack traces where they did not exist before. This only improves the quality of existing RuntimeExceptions from skyframe or exceptions printed with --verbose_failures.

@jin jin assigned jin and philwo and unassigned jin Mar 9, 2020
@jin jin requested a review from philwo March 9, 2020 17:35
@jin
Copy link
Member

jin commented Mar 9, 2020

Sorry, was OOO. This looks like a general developer experience improvement. @philwo wdyt?

Copy link
Contributor

@ulfjack ulfjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It generally looks like an improvement.

add(e);
}

void add(IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not override addSuppressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted a filter on this, so that it couldn't contain/suppress anything but IOExceptions. That is probably unnecessary, but I felt it needed scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as it turns out, addSuppressed is final

Copy link
Contributor

@ulfjack ulfjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a documentation nit

add(e);
}

void add(IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add documentation that this works like addSuppressed and that we can't override addSuppressed?

Comment on lines 72 to 77
IOException ioEx;
if (cause instanceof IOException) {
ioEx = new IOException(cause);
} else {
ioEx = (IOException) cause;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IOException ioEx;
if (cause instanceof IOException) {
ioEx = new IOException(cause);
} else {
ioEx = (IOException) cause;
}
IOException ioEx = cause instanceof IOException ? (IOException) cause : new IOException(cause);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a better solution to this that looks more like what's available in RemoteCache.java, i'll refactor that and replace this.

* its place to selectively filter and record whether all suppressed
* exceptions are CacheNotFoundExceptions
*/
void add(IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can ever be called in a multi-threaded context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot in any of the uses I've added here - all the aggregations happen on a single thread.

werkt and others added 6 commits March 27, 2020 11:12
…rException.java

Co-Authored-By: Ulf Adams <ulfjack@users.noreply.github.com>
All transfers use this routine to wait for future list completion with
optional cancellation, which replaces waitForUploads. Download futures
are strongly encouraged to present IOExceptions through getFromFuture.
Removed unused getFromFuture instance method. Upload cancellations,
whether from inputs or outputs, remain disabled due to bazelbuild#10846.
@werkt
Copy link
Contributor Author

werkt commented Apr 7, 2020

Can we see this through? The problem that this corrects in >1.1.0 is keeping us from upgrading to 3.0.0 or even any 2. @philwo?

@jin
Copy link
Member

jin commented Apr 8, 2020

I can bring it in.

@dws
Copy link
Contributor

dws commented Apr 8, 2020

I can bring it in.

If this could be part of 3.1 that would be excellent.

@jin
Copy link
Member

jin commented Apr 8, 2020

If this could be part of 3.1 that would be excellent.

cc @laurentlb (release manager for 3.1 #11076)

@laurentlb
Copy link
Contributor

Sounds good. Please leave a comment on #11076 with the commit hash.

@werkt
Copy link
Contributor Author

werkt commented Apr 14, 2020

Any movement here? We need a commit in master before we can get ourselves into the running for 3.1

@jin
Copy link
Member

jin commented Apr 14, 2020 via email

@bazel-io bazel-io closed this in aeee3e0 Apr 15, 2020
@laurentlb laurentlb mentioned this pull request Apr 15, 2020
10 tasks
@werkt werkt deleted the remote-traces branch April 15, 2020 13:48
laurentlb pushed a commit that referenced this pull request Apr 15, 2020
Create an encapsulating DownloadException to represent the aggregate
exception set of an ActionResult download. IOExceptions will be retained
through exception suppression, and the outer exception has a property to
indicate if it only represents a sequence of CacheNotFoundExceptions.

InterruptedExceptions interception is improved to cancel pending work
and wrap, through suppression, any DownloadException that also occurred
during the download. InterruptedException being thrown on the download
control thread, it does not require suppression of further interrupts,
and can represent an outer download exception. Thread interrupt status
is suppressed for cancellations, and conveyed on throw.

These exception wrapping efforts allow non-asynchronous frame
representation in stack traces, and much clearer identification of
sources within remote strategy execution which produce failures based on
remote errors.

Any DownloadException in the last-ditch output download under
handleError in RemoteSpawnRunner is added as suppressed to the
initiating exception. Other exceptions (likely local IO) present clear
immediate traces and do not require specialized treatment.

Closes #10029.

PiperOrigin-RevId: 306619678
laurentlb pushed a commit that referenced this pull request Apr 16, 2020
Create an encapsulating DownloadException to represent the aggregate
exception set of an ActionResult download. IOExceptions will be retained
through exception suppression, and the outer exception has a property to
indicate if it only represents a sequence of CacheNotFoundExceptions.

InterruptedExceptions interception is improved to cancel pending work
and wrap, through suppression, any DownloadException that also occurred
during the download. InterruptedException being thrown on the download
control thread, it does not require suppression of further interrupts,
and can represent an outer download exception. Thread interrupt status
is suppressed for cancellations, and conveyed on throw.

These exception wrapping efforts allow non-asynchronous frame
representation in stack traces, and much clearer identification of
sources within remote strategy execution which produce failures based on
remote errors.

Any DownloadException in the last-ditch output download under
handleError in RemoteSpawnRunner is added as suppressed to the
initiating exception. Other exceptions (likely local IO) present clear
immediate traces and do not require specialized treatment.

Closes #10029.

PiperOrigin-RevId: 306619678
werkt pushed a commit to werkt/bazel that referenced this pull request Apr 17, 2020
ActionResult Orphaned Output Directories must be wrapped in
BulkTransferExceptions in order to be eligible for re-execution. This
change prevents an unavoidable build failure introduced with bazelbuild#10029 in
the event that an output directory tree specified in an action result
is missing from the CAS.
werkt pushed a commit to werkt/bazel that referenced this pull request Apr 17, 2020
ActionResult Orphaned Output Directories must be wrapped in
BulkTransferExceptions in order to be eligible for re-execution. This
change prevents an unavoidable build failure introduced with bazelbuild#10029 in
the event that an output directory tree specified in an action result
is missing from the CAS, and adds testing that would detect such a
regression.
bazel-io pushed a commit that referenced this pull request Apr 17, 2020
ActionResult Orphaned Output Directories must be wrapped in
BulkTransferExceptions in order to be eligible for re-execution. This
change prevents an unavoidable build failure introduced with #10029 in
the event that an output directory tree specified in an action result
is missing from the CAS, and adds testing that would detect such a
regression.

Closes #11140.

PiperOrigin-RevId: 307027914
laurentlb pushed a commit that referenced this pull request Apr 17, 2020
Create an encapsulating DownloadException to represent the aggregate
exception set of an ActionResult download. IOExceptions will be retained
through exception suppression, and the outer exception has a property to
indicate if it only represents a sequence of CacheNotFoundExceptions.

InterruptedExceptions interception is improved to cancel pending work
and wrap, through suppression, any DownloadException that also occurred
during the download. InterruptedException being thrown on the download
control thread, it does not require suppression of further interrupts,
and can represent an outer download exception. Thread interrupt status
is suppressed for cancellations, and conveyed on throw.

These exception wrapping efforts allow non-asynchronous frame
representation in stack traces, and much clearer identification of
sources within remote strategy execution which produce failures based on
remote errors.

Any DownloadException in the last-ditch output download under
handleError in RemoteSpawnRunner is added as suppressed to the
initiating exception. Other exceptions (likely local IO) present clear
immediate traces and do not require specialized treatment.

Closes #10029.

PiperOrigin-RevId: 306619678
laurentlb pushed a commit that referenced this pull request Apr 17, 2020
ActionResult Orphaned Output Directories must be wrapped in
BulkTransferExceptions in order to be eligible for re-execution. This
change prevents an unavoidable build failure introduced with #10029 in
the event that an output directory tree specified in an action result
is missing from the CAS, and adds testing that would detect such a
regression.

Closes #11140.

PiperOrigin-RevId: 307027914
@laurentlb
Copy link
Contributor

The fix was cherrypicked on Friday.
If you try the release candidate, let me know if there's any other blocker (we might do the release tomorrow, but can delay it if it helps).

@werkt
Copy link
Contributor Author

werkt commented Apr 20, 2020

@laurentlb we've been building since rc3 was cut and haven't found any other blockers, thanks for the cherry pick!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants