Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in AbstractParallelEvaluator when using gRPC cache with --remote_upload_local_results and an error happens during FindMissingBlobs #11019

Closed
brentleyjones opened this issue Mar 25, 2020 · 5 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) stale Issues or PRs that are stale (no activity for 30 days) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@brentleyjones
Copy link
Contributor

brentleyjones commented Mar 25, 2020

Description of the problem / feature request:

When connecting to a gRPC cache, with --remote_upload_local_results, if the cache returns status errors during FindMissing, bazel crashes in AbstractParallelEvaluator$Evaluate.run. No crash occurs with --noremote_upload_local_results.

Here is an example log:

WARNING: Reading from Remote Cache:
UNIMPLEMENTED: AC - Get
Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=@com_google_pro
tobuf//:protobuf BuildConfigurationValue.Key[a53f69ba2e7dbb0e8ac6959a615175899ae468e3bba136b6b0abd494c14bb922] true, actionIndex=25}' (requested by nodes 'ActionLookupData{actio
nLookupKey=@com_google_protobuf//:protobuf BuildConfigurationValue.Key[a53f69ba2e7dbb0e8ac6959a615175899ae468e3bba136b6b0abd494c14bb922] true, actionIndex=53}')
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:515)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:399)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNIMPLEMENTED: CAS - FindMissing
        at io.grpc.Status.asRuntimeException(Status.java:532)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:398)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        ... 3 more

You can see from the log that it's the FindMissing call that is failing.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  • Connect to a gRPC cache that throws status errors for all Get, Put, and FindMissingBlobs.
  • Use --remote_upload_local_results.
  • Run a build.
  • The FindMissing status errors and causes a crash.

Changing it so Get and Put always fail, but FindMissingBlobs never does, does not crash.

What operating system are you running Bazel on?

macOS 10.14.6

What's the output of bazel info release?

release 2.1.0

Have you found anything relevant by searching the web?

@jin jin added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Mar 26, 2020
@rmaz
Copy link

rmaz commented Apr 7, 2020

We are also seeing this with broken connections during upload:

./bazelw build //apps/App --remote_cache=https://blah.com/ --remote_upload_local_results=true --remote_download_toplevel --disk_cache= --remote_timeout=800 --remote_retries=10
INFO: Invocation ID: 5ef167bd-4511-41a3-b945-cb474577580d
INFO: Analyzed target //apps/App:App (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /Users/_/projects/ios/apps/App/BUILD:168:1: Linking of rule '//apps/App:App.__internal__.apple_binary' failed due to unexpected I/O exception: null
java.nio.channels.ClosedChannelException
    at com.google.devtools.build.lib.remote.http.AbstractHttpHandler.channelInactive(AbstractHttpHandler.java:176)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:420)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
    at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:282)
    at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
    at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1076)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
    at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1403)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:827)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:495)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)
Target //apps/App:App failed to build
INFO: Elapsed time: 166.319s, Critical Path: 81.11s
INFO: 2179 processes: 1888 remote cache hit, 291 local.
FAILED: Build did NOT complete successfully

This is with version 2.2.0.

@brentleyjones
Copy link
Contributor Author

With Bazel 3.0.0 (and probably 2.1.0 as well) I also get this crash when there is a broken connection in the middle of a build:

java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: connection error: desc = "transport: Error while dialing dial tcp 10.101.131.72:9092: connect: connection refused"
	at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$downloadBlob$10(GrpcCacheClient.java:324)
	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:175)
	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:162)
	at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:107)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1024)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:866)
	at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:752)
	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:185)
	at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:162)
	at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:116)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1024)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:866)
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:711)
	at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54)
	at com.google.devtools.build.lib.remote.GrpcCacheClient$2.onError(GrpcCacheClient.java:364)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:398)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

@ulfjack
Copy link
Contributor

ulfjack commented Sep 5, 2020

@rmaz the issue with the HTTP cache is different, and tracked in #10159.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed untriaged labels Dec 9, 2020
@brentleyjones brentleyjones changed the title Crash in AbstractParallelEvaluator when using gRPC cache with --remote_upload_local_results and an error happens during FindMissing Crash in AbstractParallelEvaluator when using gRPC cache with --remote_upload_local_results and an error happens during FindMissingBlobs Feb 2, 2022
@brentleyjones brentleyjones changed the title Crash in AbstractParallelEvaluator when using gRPC cache with --remote_upload_local_results and an error happens during FindMissingBlobs Crash in AbstractParallelEvaluator when using gRPC cache with --remote_upload_local_results and an error happens during FindMissingBlobs Feb 2, 2022
@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Jun 13, 2023
@github-actions
Copy link

This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage). Thanks!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) stale Issues or PRs that are stale (no activity for 30 days) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

5 participants