Bazel CI, flaky error: io.netty.handler.codec.UnsupportedMessageTypeException #7464

meteorcloudy · 2019-02-19T12:33:46Z

Found in Android Testing:
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/813#f6c47ea2-aedc-40c7-836f-b5ed4bf36a81

Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=@androidx_appcompat_appcompat_1_0_0//:androidx_appcompat_appcompat_1_0_0 BuildConfigurationValue.Key[40f42623ef123d962c32d7f8917262eb] false, actionIndex=2}' (requested by nodes 'File:[[<execution_root>]bazel-out/android-armeabi-v7a-fastbuild/bin]external/androidx_appcompat_appcompat_1_0_0/_aar/unzipped/resources/androidx_appcompat_appcompat_1_0_0', 'File:[[<execution_root>]bazel-out/android-armeabi-v7a-fastbuild/bin]external/androidx_appcompat_appcompat_1_0_0/_aar/unzipped/assets/androidx_appcompat_appcompat_1_0_0')
--
  | at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:514)
  | at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:370)
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  | at java.base/java.lang.Thread.run(Unknown Source)
  | Caused by: io.netty.handler.codec.UnsupportedMessageTypeException: com.google.devtools.build.lib.remote.blobstore.http.DownloadCommand (expected:

Also found in Bazel Watcher:
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/813#50e4e3cb-82e0-4bb0-ac89-2995dcdbb9db


Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=@io_bazel_rules_go//:stdlib BuildConfigurationValue.Key[8a46327d9cc0e4c8177d8b28029b3954] false, actionIndex=1}' (requested by nodes 'File:[[<execution_root>]bazel-out/darwin-fastbuild/bin]external/io_bazel_rules_go/darwin_amd64_race_stripped/stdlib%/pkg')
--
  | at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:514)
  | at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:370)
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  | at java.base/java.lang.Thread.run(Unknown Source)
  | Caused by: java.lang.UnsupportedOperationException: unsupported message type: DownloadCommand (expected: ByteBuf, FileRegion)

The text was updated successfully, but these errors were encountered:

meteorcloudy · 2019-02-19T12:42:07Z

@buchgr Do you know what's going on here?

meteorcloudy · 2019-02-19T12:42:30Z

FYI @ulfjack @philwo

meteorcloudy · 2019-02-19T14:14:11Z

The earliest failure I can find is https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/802#f07ca2fa-f603-4e74-8ffb-9bc3b16b9a9b, which happens at Feb, 10th.

meteorcloudy · 2019-02-19T14:14:24Z

FYI @meisterT

meteorcloudy · 2019-02-19T14:18:58Z

~~The Exception is thrown at~~

bazel/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpDownloadHandler.java

Line 66 in 285c03e

"Unsupported message type: " + StringUtil.simpleClassName(msg)),

meisterT · 2019-02-19T14:41:47Z

The last change to the http download handler is 285c03e which could have caused this.

cc @nicolov who authored the change

nicolov · 2019-02-19T16:08:51Z

I'm sorry, @buchgr made a few exception handling changes on top of my original change, and I have no idea how those work.

meteorcloudy · 2019-02-20T09:50:23Z

What I said at https://github.com/bazelbuild/bazel/issues/7464#issuecomment-465145535 is wrong.
The Exception was thrown at AbstractNioByteChannel.java:270 from at com.google.devtools.build.lib.remote.blobstore.http.HttpBlobStore.lambda$get$3(HttpBlobStore.java:417)

This reverts commit 285c03e. We try to validate our assumption that this causes #7464.

meisterT · 2019-02-20T12:31:12Z

@meteorcloudy and me tried to repro the failure and failed even when running it 20 times. Our current assumption is that we cannot repro because after a successful run the remote cache is populated and that changes which code paths that are triggered. We also played around with the remote_timeout setting.

The new plan is the following: I created a separate branch including the rollback and triggered the downstream pipeline on it. I'll do so ~twice a day and if we don't see the failure there, we'll roll back in master (and for 0.23).

lfpino · 2019-02-20T14:52:16Z

@meteorcloudy @meisterT @ulfjack and I did some inspection on this problem and understood a bit better but no idea yet what's causing it. Describing the way I saw it below:

The problems is that a DownloadCommand is being sent [1] using the SslHandler [2] instead of any of the other handlers [3] and since writeAndFlush accepts an arbitrary Object then it ends up triggering [4] (Forgive the version number, I don't remember which netty we're using).

[1] https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java#L417
[2] https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java#L229
[3] https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java#L267
[4] https://github.com/whg333/netty-4.0.36.Final/blob/master/src/main/java/io/netty/handler/ssl/SslHandler.java#L473

meteorcloudy · 2019-02-20T16:48:12Z

@Globegitter mentioned this in #7459:

We are using a remote cache in ci via https://github.com/notnoopci/bazel-remote-proxy. The file that is being talked about in the exception: File:[[<execution_root>]bazel-out/k8-fastbuild/bin]indexpage/.nuxt') is a directory that is created by a rule via actions.declare_directory(...)

meteorcloudy · 2019-02-21T08:13:43Z

With the help of @lfpino , I finally have a reproduce with https://github.com/meteorcloudy/my_tests/tree/master/tree_artifact_test. Will try to confirm if 285c03e is the culprit.

meteorcloudy · 2019-02-21T11:45:36Z

So bisecting with the reproduce case shows 1532df0 as the culprit.
But there is no clean rollback of this commit.
@benjaminp @ulfjack Can you help rollback ?

ulfjack · 2019-02-21T13:37:18Z

How do you reproduce the issue?

meteorcloudy · 2019-02-21T13:52:15Z

By building https://github.com/meteorcloudy/my_tests/tree/master/tree_artifact_test with

bazel clean --expunge
bazel build --show_progress_rate_limit=5 --curses=yes --color=yes --verbose_failures --keep_going --jobs=32 --announce_rc --experimental_multi_threaded_digest --sandbox_tmpfs_path=/tmp --remote_timeout=1 --disk_cache= --remote_max_connections=2000 --host_platform_remote_properties_override='properties:{name:"platform" value:"ubuntu1604"}' --google_credentials=/usr/local/google/home/pcloudy/bin/bazel-untrusted.json --remote_http_cache=https://storage.googleapis.com/pcloudy-test //...

meteorcloudy · 2019-02-21T13:54:08Z

You can reproduce on a CI Linux machine inside docker or on your local machine if you setup Google credentials for bazel-untrusted.

You can download a bazel binary built at a specific commit by gsutil cp gs://bazel-builds/artifacts/ubuntu1404/$commit/bazel ~/bin/bazel-$commit && chmod +x ~/bin/bazel-$commit

benjaminp · 2019-02-21T18:04:56Z

I think my commit probably just exposed an underlying issue. In fact, a major rationale for 1532df0 was to stop hiding RuntimeExceptions from programming errors in IOException. A nasty but explicit way to rollback to the previous behavior would be to stick try { ... } catch (UnsupportedMessageTypeException e) { throw new IOException(e) } around the get() implementation of the HttpBlobStore.

I don't suppose bazel-untrusted is going to be available to me, but I looked briefly at the circumstantial evidence. Here's the interesting part of a stacktrace from BuildKite:

Caused by: io.netty.handler.codec.UnsupportedMessageTypeException: com.google.devtools.build.lib.remote.blobstore.http.DownloadCommand (expected: io.netty.buffer.ByteBuf)
        at io.netty.handler.ssl.SslHandler.write(SslHandler.java:710)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:801)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:804)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:804)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831)
        at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1041)
        at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300)
        at com.google.devtools.build.lib.remote.blobstore.http.HttpBlobStore.lambda$get$3(HttpBlobStore.java:417)

From this, you can see the AbstractChannelHandlerContext is skipping two pipeline handlers, probably HttpDownloadHandler and HttpClientCodec, in invokeAndFush. That is presumably because the Channel pipeline is in some half torn-down or setup state.

BTW, it would be great if we could get Bazel's java.log out of BuildKite.

Reflexe · 2019-02-21T21:36:42Z

Interesting; If I understand it correctly; the only way AbstractChannelHandlerContext will skip handler is because it is in an invalid state (INIT, ADD_PENDING, REMOVE_COMPLETE) that would happen only in case that add{Last,First} while the channel wasn't in an event loop (ADD_PENDING) or when the handler has been removed (releaseDownloadChannel).

Now, I'm a little bit tired but I think I could find a few places where releaseDownloadChannel could be called twice on the same channel (for example:

bazel/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java

Line 435 in 285c03e

getAfterCredentialRefresh(download, outerF);

and

bazel/src/main/java/com/google/devtools/build/lib/remote/blobstore/http/HttpBlobStore.java

Line 445 in 285c03e

releaseDownloadChannel(ch);

). That could lead to our problem (pipeline with invalid handlers).

benjaminp · 2019-02-22T07:22:11Z

There's a race where HttpDownloadHandler & HttpUploadHandler fire their finished promises before closing the channel. A channel that's in the process of shutting down could be put back into the pool and handed off to another client.

meteorcloudy · 2019-02-22T09:29:33Z

@benjaminp Sounds like you already found the cause, can you send a fix for this? Thanks!
If you still need the Bazel's java.log, I can reproduce on my machine then share it with you.

meteorcloudy · 2019-02-22T09:37:26Z

@benjaminp The java.log is here: https://gist.github.com/meteorcloudy/8002c1fad09544656cb5fc7aae2efa18

May fix bazelbuild#7464.

ulfjack · 2019-02-22T09:58:32Z

This may fix it:
ulfjack@d670b7c

Can you give that a try?

May fix #7464.

meteorcloudy · 2019-02-22T10:25:16Z

Still happening with ulfjack/bazel@d670b7c

May fix bazelbuild#7464.

ulfjack · 2019-02-22T11:32:26Z

I missed a few places. Here's another try:
ulfjack@9756235

May fix #7464.

meteorcloudy · 2019-02-22T11:46:29Z

Still happening with ulfjack/bazel@9756235

Can you push to a branch at https://github.com/bazelbuild/bazel for later commits? It would be easier for me to build the bazel binary at the commit

If addLast is called outside an event loop, then the handler is added in the 'pending' state. Sending an event to the pipeline does not send it to the last handler, but to the last _non-pending_ handler. We therefore have to make sure to involve the event loop _before_ marking the channel as ready to be used. Fixes bazelbuild#7464.

@Reflexe

If addLast is called outside an event loop, then the handler is added in the 'pending' state. Sending an event to the pipeline does not send it to the last handler, but to the last _non-pending_ handler. We therefore have to make sure to involve the event loop _before_ marking the channel as ready to be used. Thanks to @Reflexe who pointed me in the right direction. Fixes bazelbuild#7464.

@Reflexe

If addLast is called outside an event loop, then the handler is added in the 'pending' state. Sending an event to the pipeline does not send it to the last handler, but to the last _non-pending_ handler. We therefore have to make sure to involve the event loop _before_ marking the channel as ready to be used. Thanks to @Reflexe who pointed me in the right direction. Fixes #7464. Closes #7509. GERRIT_CHANGE_ID= PiperOrigin-RevId: 235184010

@Reflexe

If addLast is called outside an event loop, then the handler is added in the 'pending' state. Sending an event to the pipeline does not send it to the last handler, but to the last _non-pending_ handler. We therefore have to make sure to involve the event loop _before_ marking the channel as ready to be used. Thanks to @Reflexe who pointed me in the right direction. Fixes #7464. Closes #7509. GERRIT_CHANGE_ID= PiperOrigin-RevId: 235184010

@Reflexe

If addLast is called outside an event loop, then the handler is added in the 'pending' state. Sending an event to the pipeline does not send it to the last handler, but to the last _non-pending_ handler. We therefore have to make sure to involve the event loop _before_ marking the channel as ready to be used. Thanks to @Reflexe who pointed me in the right direction. Fixes #7464. Closes #7509. GERRIT_CHANGE_ID= PiperOrigin-RevId: 235184010

@Reflexe

If addLast is called outside an event loop, then the handler is added in the 'pending' state. Sending an event to the pipeline does not send it to the last handler, but to the last _non-pending_ handler. We therefore have to make sure to involve the event loop _before_ marking the channel as ready to be used. Thanks to @Reflexe who pointed me in the right direction. Fixes #7464. Closes #7509. GERRIT_CHANGE_ID= PiperOrigin-RevId: 235184010

meteorcloudy added type: bug P1 I'll work on this now. (Assignee required) breakage labels Feb 19, 2019

This was referenced Feb 19, 2019

UnsupportedOperationException: unsupported message type: DownloadCommand #7459

Closed

Release 0.23 - February 2019 #6495

Closed

meteorcloudy mentioned this issue Feb 19, 2019

culprit_finder.py: Add REPEAT_TIMES to detect flaky build failure bazelbuild/continuous-integration#483

Merged

meisterT added a commit that referenced this issue Feb 20, 2019

Revert "remote: fix timeout for http blob store"

616125a

This reverts commit 285c03e. We try to validate our assumption that this causes #7464.

lfpino mentioned this issue Feb 20, 2019

Using invalid --remote_cache crashes Bazel #7478

Closed

ulfjack added a commit to ulfjack/bazel that referenced this issue Feb 22, 2019

Reverse order of closing and completing the promise

d670b7c

May fix bazelbuild#7464.

meteorcloudy pushed a commit that referenced this issue Feb 22, 2019

Reverse order of closing and completing the promise

cd35b4e

May fix #7464.

meteorcloudy mentioned this issue Feb 22, 2019

Remote cache sometimes hangs for the last action #7505

Closed

ulfjack added a commit to ulfjack/bazel that referenced this issue Feb 22, 2019

Reverse order of closing and completing the promise

9756235

May fix bazelbuild#7464.

meteorcloudy pushed a commit that referenced this issue Feb 22, 2019

Reverse order of closing and completing the promise

8e5a30a

May fix #7464.

ulfjack self-assigned this Feb 22, 2019

ulfjack mentioned this issue Feb 22, 2019

Complete channel initialization in the event loop #7509

Closed

bazel-io closed this as completed in f9eb1b5 Feb 22, 2019

philwo mentioned this issue Mar 7, 2019

Initial attempt at direct s3 remote cache #4889

Closed

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel CI, flaky error: io.netty.handler.codec.UnsupportedMessageTypeException #7464

Bazel CI, flaky error: io.netty.handler.codec.UnsupportedMessageTypeException #7464

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019 •

edited

Loading

meisterT commented Feb 19, 2019

nicolov commented Feb 19, 2019

meteorcloudy commented Feb 20, 2019

meisterT commented Feb 20, 2019 •

edited

Loading

lfpino commented Feb 20, 2019

meteorcloudy commented Feb 20, 2019

meteorcloudy commented Feb 21, 2019

meteorcloudy commented Feb 21, 2019

ulfjack commented Feb 21, 2019

meteorcloudy commented Feb 21, 2019

meteorcloudy commented Feb 21, 2019

benjaminp commented Feb 21, 2019

Reflexe commented Feb 21, 2019 •

edited

Loading

benjaminp commented Feb 22, 2019

meteorcloudy commented Feb 22, 2019 •

edited

Loading

meteorcloudy commented Feb 22, 2019

ulfjack commented Feb 22, 2019

meteorcloudy commented Feb 22, 2019

ulfjack commented Feb 22, 2019

meteorcloudy commented Feb 22, 2019

Bazel CI, flaky error: io.netty.handler.codec.UnsupportedMessageTypeException #7464

Bazel CI, flaky error: io.netty.handler.codec.UnsupportedMessageTypeException #7464

Comments

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019

meteorcloudy commented Feb 19, 2019 • edited Loading

meisterT commented Feb 19, 2019

nicolov commented Feb 19, 2019

meteorcloudy commented Feb 20, 2019

meisterT commented Feb 20, 2019 • edited Loading

lfpino commented Feb 20, 2019

meteorcloudy commented Feb 20, 2019

meteorcloudy commented Feb 21, 2019

meteorcloudy commented Feb 21, 2019

ulfjack commented Feb 21, 2019

meteorcloudy commented Feb 21, 2019

meteorcloudy commented Feb 21, 2019

benjaminp commented Feb 21, 2019

Reflexe commented Feb 21, 2019 • edited Loading

benjaminp commented Feb 22, 2019

meteorcloudy commented Feb 22, 2019 • edited Loading

meteorcloudy commented Feb 22, 2019

ulfjack commented Feb 22, 2019

meteorcloudy commented Feb 22, 2019

ulfjack commented Feb 22, 2019

meteorcloudy commented Feb 22, 2019

meteorcloudy commented Feb 19, 2019 •

edited

Loading

meisterT commented Feb 20, 2019 •

edited

Loading

Reflexe commented Feb 21, 2019 •

edited

Loading

meteorcloudy commented Feb 22, 2019 •

edited

Loading