Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected java_tests_mac failure #2136

Closed
alyssawilk opened this issue Mar 30, 2022 · 7 comments · Fixed by #2191
Closed

unexpected java_tests_mac failure #2136

alyssawilk opened this issue Mar 30, 2022 · 7 comments · Fixed by #2191
Assignees
Labels

Comments

@alyssawilk
Copy link
Contributor

https://github.com/envoyproxy/envoy-mobile/runs/5744148090?check_suite_focus=true

WARNING: /Users/runner/work/envoy-mobile/envoy-mobile/test/common/jni/BUILD:17:10: in cc_binary rule //test/common/jni:libquic_test_server_jni.so: target '//test/common/jni:libquic_test_server_jni.so' depends on deprecated target '@local_jdk//:jni_header': Don't depend on targets in the JDK workspace; use @bazel_tools//tools/jdk:current_java_runtime instead (see bazelbuild/bazel#5594)

@alyssawilk alyssawilk added the ci label Mar 30, 2022
@ulfjack
Copy link
Contributor

ulfjack commented Mar 30, 2022

The warning is unrelated, as far as I can tell. The problem is that the //test/java/org/chromium/net:net_tests target failed; maybe the testOrderCanceledRequest test deadlocked?

Might be easier to see here:
https://envoy.cluster.engflow.com/invocation/8e00c3c5-5e4a-47ed-a279-f4f0de986f16#targets-Ly90ZXN0L2phdmEvb3JnL2Nocm9taXVtL25ldDpuZXRfdGVzdHM=

@alyssawilk
Copy link
Contributor Author

Ah yep, thats my bad. Also that's a MUCH better dashboard.

Looks like this is chrome test failures, @carloseltuerto can you take a look?

@ulfjack
Copy link
Contributor

ulfjack commented Mar 30, 2022

When I looked at the GitHub page, it was initially all black until I scrolled all the way to the bottom, and the warning was the last visible thing before the black.

@carloseltuerto
Copy link
Contributor

This is the sequence:

  • onHeader callback with endStream=false
  • stream.readData()
  • stream.cancel()
  • onData callback with endStream=true
  • EM erases stream
  • no final callback. At first sight, this is a contract breach.

This error is systematic - it is not flaky on Linux. Something has changed lately? This test has been working for weeks.

[2022-03-30 22:20:57.653][1028342][debug][http] [library/common/http/client.cc:86] [S0] response data for stream (length=23 end_stream=false)
[2022-03-30 22:20:57.653][1028342][debug][http] [library/common/http/client.cc:118] [S0] buffering 23 bytes due to explicit flow control. 23 total bytes buffered.
[2022-03-30 22:20:57.660][1028342][debug][http] [library/common/http/client.cc:184] [S0] received resume data call for 32768 bytes

JavaJNI call: JniLibrary.readData
JavaJNI call: JniLibrary.resetStream

[2022-03-30 22:20:57.660][1028342][debug][http] [library/common/http/client.cc:135] [S0] dispatching to platform response data for stream (length=23 end_stream=true)

main_interface.reset_stream(envoy_stream_t stream)

[2022-03-30 22:20:57.660][1028342][debug][http] [library/common/http/client.cc:585] [S0] erased stream from streams container
[2022-03-30 22:20:57.661][1028342][debug][http] [library/common/http/client.cc:224] [S0] complete stream (success=true)
[2022-03-30 22:20:57.662][1028342][debug][http] [library/common/http/client.cc:344] [S0] destroy stream

@alyssawilk
Copy link
Contributor Author

Ok, finally had some time to dig into this and I'm seeing pretty different behavior locally.
first I can't run these in parallel which I assume is a known limitation?

Running bazel test --define=signal_trace=disabled //test/java/org/chromium/net:net_tests --runs_per_test=2 --local_test_jobs=1 seems to pass. I'll try a couple of hundred runs in parallel and see if I can reproduce.

@carloseltuerto
Copy link
Contributor

carloseltuerto commented Apr 23, 2022

Alyssa, unfortunately, I have sent you on a goose chase. The final callback is being sent to the Java layer - I just did not recognise the log entry:

[2022-03-30 22:20:57.661][1028342][debug][http] [library/common/http/client.cc:224] [S0] complete stream (success=true)

I am really sorry. This is a race condition in Cronvoy.

@carloseltuerto
Copy link
Contributor

cc: @alyssawilk @StefanoDuo
The race condition occurs when a "cancel" is invoked after receiving this network callback onData(buffer, true). Since the endStream occurred, Cronvoy does not attempt to invoke "cancel" - would do no good at this point. Instead, Cronvoy immediately posts a User task to handle the cancellation. Once that User task completes, Cronvoy posts the CronetMetrics. But this is racy: the CronetMetrics may not be available yet: the needed EnvoyFinalStreamIntel is to be received first through this network callback: onComplete. So, if onComplete callback arrives a bit late, then posting the CronetMetrics is not performed. The failing tests waits for the CronetMetrics posting to finish: it times out in that case.

alyssawilk pushed a commit that referenced this issue Apr 25, 2022
At the end of a request, mRequestFinishedListener is used to post the CronetMetrics. Before doing so, two events must have occurred first: the "final user callback" and the "final Network callback". The Thread involved with the last of these two events is in charge of the posting the CronetMetrics - this is intrinsically racy. This PR ensures a proper behaviour.

Description: Solves CI test flakiness: unexpected java_tests_mac failure: test/java/org/chromium/net:net_tests
Risk Level: N/A (Cronet is still in development)
Testing: CI, and local multi runs
Docs Changes: N/A
Release Notes: N/A
Fixes #2136
Signed-off-by: Charles Le Borgne cleborgne@google.com

Signed-off-by: Charles Le Borgne <cleborgne@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants