Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased DNS Timeouts After Upgrading to v1.31.0 #35117

Closed
agrawroh opened this issue Jul 9, 2024 · 17 comments · Fixed by #35335
Closed

Increased DNS Timeouts After Upgrading to v1.31.0 #35117

agrawroh opened this issue Jul 9, 2024 · 17 comments · Fixed by #35335
Labels
area/dns bug stale stalebot believes this issue/PR has not been touched recently

Comments

@agrawroh
Copy link
Contributor

agrawroh commented Jul 9, 2024

Description

We have observed an increase in DNS timeouts after upgrading our Dev/Staging environment from v1.30.2 to v1.31.0 (Dev). All other factors, including traffic load, remained unchanged. We suspect this issue might be related to the recent major version upgrade of c-ares [Reference].

DNS Resolver Events:

An uptick in the DNS timeouts after switching to v1.31.0 which goes away once we move back to v1.30.2,

image

Number of DNS resolutions (didn't change),

Screenshot 2024-07-05 at 10 43 45

cc @alyssawilk @deveshkandpal1224

@agrawroh agrawroh added bug triage Issue requires triage labels Jul 9, 2024
@nezdolik nezdolik added area/dns and removed triage Issue requires triage labels Jul 9, 2024
@nezdolik
Copy link
Member

nezdolik commented Jul 9, 2024

cc @yanavlasov @mattklein123

@alyssawilk
Copy link
Contributor

Huh. Would it be possible to run an experiment with the version downgraded back down? If so we could downgrade Envoy's dependency but I think we'd have to file with cares to fix the underlying problem

@arulthileeban
Copy link
Contributor

This could be related: c-ares/c-ares#542, which is part of the c-ares v.1.20.0 release. AFAIK Envoy uses the defaults for timeouts. Should we bump the defaults internally and/or expose it through an option?

@agrawroh
Copy link
Contributor Author

This could be related: c-ares/c-ares#542, which is part of the c-ares v.1.20.0 release. AFAIK Envoy uses the defaults for timeouts. Should we bump the defaults internally and/or expose it through an option?

@arulthileeban Thanks for the pointers. Re: DNS Timeouts, I tried doing that here and am happy to re-open this PR and scope it to c-ares only for now.

@alyssawilk
Copy link
Contributor

Sure we can go with that or I can land #35335 which attempts to just restore prior behavior. thoughts?

@agrawroh
Copy link
Contributor Author

@alyssawilk You can merge your changes and I'll also re-open my PR as I think it's good for users to have these values configurable (we might need to up it more on our end)

alyssawilk added a commit that referenced this issue Jul 23, 2024
This should theoretically restore the defaults changed in
https://github.com/c-ares/c-ares/pull/542/files

Risk Level: medium
Testing:
Docs Changes: n/a
Release Notes: n/a
fixes #35117

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor

SGTM, thanks!

@agrawroh
Copy link
Contributor Author

@alyssawilk We tried back-porting your changes to set the previous defaults but we are still seeing the timeouts issue. I even tried to increase the timeout to 10s and am still noticing similar spikes. It looks like something else changed either around how we are reporting the timeouts or in the c-ares.

Screenshot 2024-07-25 at 17 19 47

The graph is from ~30 days showing two set of spikes when we tried out rolling the new version without your changes at first and then after including the timeout changes.

@alyssawilk alyssawilk reopened this Jul 25, 2024
@alyssawilk
Copy link
Contributor

Could you try cherrypicking out https://github.com/envoyproxy/envoy/pull/33711/files (git revert or what have you) and see if that addresses the problem?

@agrawroh
Copy link
Contributor Author

agrawroh commented Aug 1, 2024

Could you try cherrypicking out https://github.com/envoyproxy/envoy/pull/33711/files (git revert or what have you) and see if that addresses the problem?

@alyssawilk Yes, reverting the c-ares version bump does solve the timeouts issue. Unfortunately, we want the udp_max_queries feature which is only available in the new version :(

@alyssawilk
Copy link
Contributor

I spun up GDB and verified that channel timeout is getting set correctly in cares
(gdb) print channel->timeout
$8 = 5000
so the "supposed fix" looks like it worked.
without being able to reproduce the error rate I think I'm effectively stymied. You might have to go poke the cares folk and ask them what else changed between the two versions which might have caused this, sorry :-/

@deveshkandpal1224
Copy link
Contributor

@agrawroh - apologies for being late to the party, I just spot checked our envs and i'm not seeing DNS timeouts with 1.31-Dev. Is there a config_dump available that you can share for comparison ?

@agrawroh
Copy link
Contributor Author

agrawroh commented Aug 2, 2024

Just a hunch, could it be related to this issue?

c-ares/c-ares@a070d78

I saw on the Dev channel that the issue was fixed upstream and we patched it,

#35511

martinduke pushed a commit to martinduke/envoy that referenced this issue Aug 8, 2024
This should theoretically restore the defaults changed in
https://github.com/c-ares/c-ares/pull/542/files

Risk Level: medium
Testing:
Docs Changes: n/a
Release Notes: n/a
fixes envoyproxy#35117

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Martin Duke <martin.h.duke@gmail.com>
@alyssawilk
Copy link
Contributor

cc @yanavlasov @adisuissa as we're discussing c_ares issues right now

@adisuissa
Copy link
Contributor

Might be.
From looking at c-ares/c-ares#571 I do see some references to timeouts, but not sure if its the same root cause.

FWIW, the fix will be part of Envoy v1.31.1, and Envoy-dev should already be using c-ares v1.21.0.

Copy link

github-actions bot commented Sep 8, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 8, 2024
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dns bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants