Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff timing verification can flake #115

Closed
dbolduc opened this issue Oct 18, 2023 · 5 comments
Closed

Backoff timing verification can flake #115

dbolduc opened this issue Oct 18, 2023 · 5 comments
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@dbolduc
Copy link
Member

dbolduc commented Oct 18, 2023

googleapis/google-cloud-cpp#12913

=== RUN   TestMutateRows_Retry_ExponentialBackoff
    mutaterows_test.go:347: The three retry delays are: 111ms, 106ms, 353ms
    mutaterows_test.go:350: 
        	Error Trace:	/var/tmp/downloads/cloud-bigtable-clients-test/tests/mutaterows_test.go:350
        	Error:      	"111" is not less than "106"
        	Test:       	TestMutateRows_Retry_ExponentialBackoff
--- FAIL: TestMutateRows_Retry_ExponentialBackoff (0.63s)

C++ has unit tests in our code that verify that the retry policy always increases. There is jitter (randomness) involved.

I think it is very likely that the computer running the build had a hiccup during the first attempt to add an extra few ms of slowness. And the backoff duration on the second attempt happened to be very close to the backoff duration of the first attempt. Thus the proxy detects a failure.

To bandaid, we can add a few ms of tolerance to the backoff expectations. But ultimately we are at the mercy of a time based test.

@dbolduc dbolduc added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Oct 18, 2023
@daniel-sanche
Copy link

daniel-sanche commented Oct 18, 2023

Python also flakes on this test, because the shared gapic api-core library uses random(0, next_max_value) when calculating backoff. I opened a PR to change the backoff calculations to be in line with the tests here, but then I saw that the official grpc service_config spec says to use Python's strategy. So maybe C++ is doing something similar?

@dbolduc
Copy link
Member Author

dbolduc commented Oct 18, 2023

Python also flakes on this test

Oh yeah... this test was probably written against the C++ client.

We should set up the test to pass in the "full jitter" case (0, N]. And by "we" I mean "someone other than me". 👃 👈

So maybe C++ is doing something similar?

Not in bigtable1. We do it in other libraries though.

https://github.com/googleapis/google-cloud-cpp/blob/2c26eb99458e32b22ebcb6c865cfa74b88a7ed5f/google/cloud/bigtable/internal/defaults.cc#L214-L215

For way too much context, see googleapis/google-cloud-cpp#8755

Footnotes

  1. I did not update the bigtable defaults, because I think they are perfectly fine as is. Customers can set a full jitter policy if they want to.

@liujiongxin
Copy link
Collaborator

I think to support the "full jitter" case, the test should drop the time check: different clients use different default, so it's not possible to check the absolute value in an agnostic way. I will make the change to just use logging for now.

@liujiongxin
Copy link
Collaborator

liujiongxin commented Oct 23, 2023

I made the change: eaf9477

In your presubmit, please checkout the new release v0.0.2

@dbolduc
Copy link
Member Author

dbolduc commented Oct 23, 2023

I made the change: eaf9477

Thanks!

In your presubmit, please checkout the new release v0.0.2

The robots did it for us.... Nice! googleapis/google-cloud-cpp#12960

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants