Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel CI, flaky error: io.netty.handler.codec.UnsupportedMessageTypeException #7464

Closed
meteorcloudy opened this issue Feb 19, 2019 · 25 comments
Assignees
Labels
breakage P1 I'll work on this now. (Assignee required) type: bug

Comments

@meteorcloudy
Copy link
Member

Found in Android Testing:
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/813#f6c47ea2-aedc-40c7-836f-b5ed4bf36a81

Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=@androidx_appcompat_appcompat_1_0_0//:androidx_appcompat_appcompat_1_0_0 BuildConfigurationValue.Key[40f42623ef123d962c32d7f8917262eb] false, actionIndex=2}' (requested by nodes 'File:[[<execution_root>]bazel-out/android-armeabi-v7a-fastbuild/bin]external/androidx_appcompat_appcompat_1_0_0/_aar/unzipped/resources/androidx_appcompat_appcompat_1_0_0', 'File:[[<execution_root>]bazel-out/android-armeabi-v7a-fastbuild/bin]external/androidx_appcompat_appcompat_1_0_0/_aar/unzipped/assets/androidx_appcompat_appcompat_1_0_0')
--
  | at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:514)
  | at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:370)
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  | at java.base/java.lang.Thread.run(Unknown Source)
  | Caused by: io.netty.handler.codec.UnsupportedMessageTypeException: com.google.devtools.build.lib.remote.blobstore.http.DownloadCommand (expected:

Also found in Bazel Watcher:
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/813#50e4e3cb-82e0-4bb0-ac89-2995dcdbb9db


Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=@io_bazel_rules_go//:stdlib BuildConfigurationValue.Key[8a46327d9cc0e4c8177d8b28029b3954] false, actionIndex=1}' (requested by nodes 'File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/io_bazel_rules_go/darwin_amd64_race_stripped/stdlib%/pkg')
--
  | at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:514)
  | at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:370)
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  | at java.base/java.lang.Thread.run(Unknown Source)
  | Caused by: java.lang.UnsupportedOperationException: unsupported message type: DownloadCommand (expected: ByteBuf, FileRegion)


@meteorcloudy meteorcloudy added type: bug P1 I'll work on this now. (Assignee required) breakage labels Feb 19, 2019
@meteorcloudy
Copy link
Member Author

@buchgr Do you know what's going on here?

@meteorcloudy
Copy link
Member Author

FYI @ulfjack @philwo

@meteorcloudy
Copy link
Member Author

The earliest failure I can find is https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/802#f07ca2fa-f603-4e74-8ffb-9bc3b16b9a9b, which happens at Feb, 10th.

@meteorcloudy
Copy link
Member Author

FYI @meisterT

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Feb 19, 2019

The Exception is thrown at

"Unsupported message type: " + StringUtil.simpleClassName(msg)),

@meisterT
Copy link
Member

The last change to the http download handler is 285c03e which could have caused this.

cc @nicolov who authored the change

@nicolov
Copy link
Contributor

nicolov commented Feb 19, 2019

I'm sorry, @buchgr made a few exception handling changes on top of my original change, and I have no idea how those work.

@meteorcloudy
Copy link
Member Author

What I said at https://github.com/bazelbuild/bazel/issues/7464#issuecomment-465145535 is wrong.
The Exception was thrown at AbstractNioByteChannel.java:270 from at com.google.devtools.build.lib.remote.blobstore.http.HttpBlobStore.lambda$get$3(HttpBlobStore.java:417)

meisterT added a commit that referenced this issue Feb 20, 2019
This reverts commit 285c03e.

We try to validate our assumption that this causes #7464.
@meisterT
Copy link
Member

meisterT commented Feb 20, 2019

@meteorcloudy and me tried to repro the failure and failed even when running it 20 times. Our current assumption is that we cannot repro because after a successful run the remote cache is populated and that changes which code paths that are triggered. We also played around with the remote_timeout setting.

The new plan is the following: I created a separate branch including the rollback and triggered the downstream pipeline on it. I'll do so ~twice a day and if we don't see the failure there, we'll roll back in master (and for 0.23).

@lfpino
Copy link
Contributor

lfpino commented Feb 20, 2019

@meteorcloudy @meisterT @ulfjack and I did some inspection on this problem and understood a bit better but no idea yet what's causing it. Describing the way I saw it below:

The problems is that a DownloadCommand is being sent [1] using the SslHandler [2] instead of any of the other handlers [3] and since writeAndFlush accepts an arbitrary Object then it ends up triggering [4] (Forgive the version number, I don't remember which netty we're using).

[1] https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java#L417
[2] https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java#L229
[3] https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java#L267
[4] https://github.com/whg333/netty-4.0.36.Final/blob/master/src/main/java/io/netty/handler/ssl/SslHandler.java#L473

@meteorcloudy
Copy link
Member Author

@Globegitter mentioned this in #7459:

We are using a remote cache in ci via https://github.com/notnoopci/bazel-remote-proxy. The file that is being talked about in the exception: File:[[<execution_root>]bazel-out/k8-fastbuild/bin]indexpage/.nuxt') is a directory that is created by a rule via actions.declare_directory(...)

@meteorcloudy
Copy link
Member Author

With the help of @lfpino , I finally have a reproduce with https://github.com/meteorcloudy/my_tests/tree/master/tree_artifact_test. Will try to confirm if 285c03e is the culprit.

@meteorcloudy
Copy link
Member Author

So bisecting with the reproduce case shows 1532df0 as the culprit.
But there is no clean rollback of this commit.
@benjaminp @ulfjack Can you help rollback ?

@ulfjack
Copy link
Contributor

ulfjack commented Feb 21, 2019

How do you reproduce the issue?

@meteorcloudy
Copy link
Member Author

By building https://github.com/meteorcloudy/my_tests/tree/master/tree_artifact_test with

bazel clean --expunge
bazel build --show_progress_rate_limit=5 --curses=yes --color=yes --verbose_failures --keep_going --jobs=32 --announce_rc --experimental_multi_threaded_digest --sandbox_tmpfs_path=/tmp --remote_timeout=1 --disk_cache= --remote_max_connections=2000 --host_platform_remote_properties_override='properties:{name:"platform" value:"ubuntu1604"}' --google_credentials=/usr/local/google/home/pcloudy/bin/bazel-untrusted.json --remote_http_cache=https://storage.googleapis.com/pcloudy-test //...

@meteorcloudy
Copy link
Member Author

You can reproduce on a CI Linux machine inside docker or on your local machine if you setup Google credentials for bazel-untrusted.

You can download a bazel binary built at a specific commit by gsutil cp gs://bazel-builds/artifacts/ubuntu1404/$commit/bazel ~/bin/bazel-$commit && chmod +x ~/bin/bazel-$commit

@benjaminp
Copy link
Collaborator

I think my commit probably just exposed an underlying issue. In fact, a major rationale for 1532df0 was to stop hiding RuntimeExceptions from programming errors in IOException. A nasty but explicit way to rollback to the previous behavior would be to stick try { ... } catch (UnsupportedMessageTypeException e) { throw new IOException(e) } around the get() implementation of the HttpBlobStore.

I don't suppose bazel-untrusted is going to be available to me, but I looked briefly at the circumstantial evidence. Here's the interesting part of a stacktrace from BuildKite:

Caused by: io.netty.handler.codec.UnsupportedMessageTypeException: com.google.devtools.build.lib.remote.blobstore.http.DownloadCommand (expected: io.netty.buffer.ByteBuf)
        at io.netty.handler.ssl.SslHandler.write(SslHandler.java:710)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:801)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:804)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:804)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831)
        at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1041)
        at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300)
        at com.google.devtools.build.lib.remote.blobstore.http.HttpBlobStore.lambda$get$3(HttpBlobStore.java:417)

From this, you can see the AbstractChannelHandlerContext is skipping two pipeline handlers, probably HttpDownloadHandler and HttpClientCodec, in invokeAndFush. That is presumably because the Channel pipeline is in some half torn-down or setup state.

BTW, it would be great if we could get Bazel's java.log out of BuildKite.

@Reflexe
Copy link
Contributor

Reflexe commented Feb 21, 2019

Interesting; If I understand it correctly; the only way AbstractChannelHandlerContext will skip handler is because it is in an invalid state (INIT, ADD_PENDING, REMOVE_COMPLETE) that would happen only in case that add{Last,First} while the channel wasn't in an event loop (ADD_PENDING) or when the handler has been removed (releaseDownloadChannel).

Now, I'm a little bit tired but I think I could find a few places where releaseDownloadChannel could be called twice on the same channel (for example:

and ). That could lead to our problem (pipeline with invalid handlers).

@benjaminp
Copy link
Collaborator

There's a race where HttpDownloadHandler & HttpUploadHandler fire their finished promises before closing the channel. A channel that's in the process of shutting down could be put back into the pool and handed off to another client.

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Feb 22, 2019

@benjaminp Sounds like you already found the cause, can you send a fix for this? Thanks!
If you still need the Bazel's java.log, I can reproduce on my machine then share it with you.

@meteorcloudy
Copy link
Member Author

ulfjack added a commit to ulfjack/bazel that referenced this issue Feb 22, 2019
@ulfjack
Copy link
Contributor

ulfjack commented Feb 22, 2019

This may fix it:
ulfjack@d670b7c

Can you give that a try?

@meteorcloudy
Copy link
Member Author

Still happening with ulfjack/bazel@d670b7c

@ulfjack
Copy link
Contributor

ulfjack commented Feb 22, 2019

I missed a few places. Here's another try:
ulfjack@9756235

@meteorcloudy
Copy link
Member Author

Still happening with ulfjack/bazel@9756235

Can you push to a branch at https://github.com/bazelbuild/bazel for later commits? It would be easier for me to build the bazel binary at the commit

ulfjack added a commit to ulfjack/bazel that referenced this issue Feb 22, 2019
If addLast is called outside an event loop, then the handler is added in
the 'pending' state. Sending an event to the pipeline does not send it
to the last handler, but to the last _non-pending_ handler. We therefore
have to make sure to involve the event loop _before_ marking the channel
as ready to be used.

Fixes bazelbuild#7464.
@ulfjack ulfjack self-assigned this Feb 22, 2019
ulfjack added a commit to ulfjack/bazel that referenced this issue Feb 22, 2019
If addLast is called outside an event loop, then the handler is added in
the 'pending' state. Sending an event to the pipeline does not send it
to the last handler, but to the last _non-pending_ handler. We therefore
have to make sure to involve the event loop _before_ marking the channel
as ready to be used.

Thanks to @Reflexe who pointed me in the right direction.

Fixes bazelbuild#7464.
laurentlb pushed a commit that referenced this issue Feb 22, 2019
If addLast is called outside an event loop, then the handler is added in
the 'pending' state. Sending an event to the pipeline does not send it
to the last handler, but to the last _non-pending_ handler. We therefore
have to make sure to involve the event loop _before_ marking the channel
as ready to be used.

Thanks to @Reflexe who pointed me in the right direction.

Fixes #7464.

Closes #7509.

GERRIT_CHANGE_ID=
PiperOrigin-RevId: 235184010
laurentlb pushed a commit that referenced this issue Feb 28, 2019
If addLast is called outside an event loop, then the handler is added in
the 'pending' state. Sending an event to the pipeline does not send it
to the last handler, but to the last _non-pending_ handler. We therefore
have to make sure to involve the event loop _before_ marking the channel
as ready to be used.

Thanks to @Reflexe who pointed me in the right direction.

Fixes #7464.

Closes #7509.

GERRIT_CHANGE_ID=
PiperOrigin-RevId: 235184010
laurentlb pushed a commit that referenced this issue Mar 7, 2019
If addLast is called outside an event loop, then the handler is added in
the 'pending' state. Sending an event to the pipeline does not send it
to the last handler, but to the last _non-pending_ handler. We therefore
have to make sure to involve the event loop _before_ marking the channel
as ready to be used.

Thanks to @Reflexe who pointed me in the right direction.

Fixes #7464.

Closes #7509.

GERRIT_CHANGE_ID=
PiperOrigin-RevId: 235184010
laurentlb pushed a commit that referenced this issue Mar 7, 2019
If addLast is called outside an event loop, then the handler is added in
the 'pending' state. Sending an event to the pipeline does not send it
to the last handler, but to the last _non-pending_ handler. We therefore
have to make sure to involve the event loop _before_ marking the channel
as ready to be used.

Thanks to @Reflexe who pointed me in the right direction.

Fixes #7464.

Closes #7509.

GERRIT_CHANGE_ID=
PiperOrigin-RevId: 235184010
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

No branches or pull requests

7 participants