-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased DNS Timeouts After Upgrading to v1.31.0 #35117
Comments
Huh. Would it be possible to run an experiment with the version downgraded back down? If so we could downgrade Envoy's dependency but I think we'd have to file with cares to fix the underlying problem |
This could be related: c-ares/c-ares#542, which is part of the c-ares v.1.20.0 release. AFAIK Envoy uses the defaults for timeouts. Should we bump the defaults internally and/or expose it through an option? |
@arulthileeban Thanks for the pointers. Re: DNS Timeouts, I tried doing that here and am happy to re-open this PR and scope it to c-ares only for now. |
Sure we can go with that or I can land #35335 which attempts to just restore prior behavior. thoughts? |
@alyssawilk You can merge your changes and I'll also re-open my PR as I think it's good for users to have these values configurable (we might need to up it more on our end) |
This should theoretically restore the defaults changed in https://github.com/c-ares/c-ares/pull/542/files Risk Level: medium Testing: Docs Changes: n/a Release Notes: n/a fixes #35117 Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
SGTM, thanks! |
@alyssawilk We tried back-porting your changes to set the previous defaults but we are still seeing the timeouts issue. I even tried to increase the timeout to 10s and am still noticing similar spikes. It looks like something else changed either around how we are reporting the timeouts or in the c-ares. ![]() The graph is from ~30 days showing two set of spikes when we tried out rolling the new version without your changes at first and then after including the timeout changes. |
Could you try cherrypicking out https://github.com/envoyproxy/envoy/pull/33711/files (git revert or what have you) and see if that addresses the problem? |
@alyssawilk Yes, reverting the c-ares version bump does solve the timeouts issue. Unfortunately, we want the udp_max_queries feature which is only available in the new version :( |
I spun up GDB and verified that channel timeout is getting set correctly in cares |
@agrawroh - apologies for being late to the party, I just spot checked our envs and i'm not seeing DNS timeouts with 1.31-Dev. Is there a config_dump available that you can share for comparison ? |
Just a hunch, could it be related to this issue? I saw on the Dev channel that the issue was fixed upstream and we patched it, |
This should theoretically restore the defaults changed in https://github.com/c-ares/c-ares/pull/542/files Risk Level: medium Testing: Docs Changes: n/a Release Notes: n/a fixes envoyproxy#35117 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> Signed-off-by: Martin Duke <martin.h.duke@gmail.com>
cc @yanavlasov @adisuissa as we're discussing c_ares issues right now |
Might be. FWIW, the fix will be part of Envoy v1.31.1, and Envoy-dev should already be using c-ares v1.21.0. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Description
We have observed an increase in DNS timeouts after upgrading our Dev/Staging environment from v1.30.2 to v1.31.0 (Dev). All other factors, including traffic load, remained unchanged. We suspect this issue might be related to the recent major version upgrade of c-ares [Reference].
DNS Resolver Events:
An uptick in the DNS timeouts after switching to v1.31.0 which goes away once we move back to v1.30.2,
Number of DNS resolutions (didn't change),
cc @alyssawilk @deveshkandpal1224
The text was updated successfully, but these errors were encountered: