-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolution failure results in UH / no healthy upstream #31992
Comments
We see a lot of timeouts, is there a way to configure timeouts? I see there is an option in the c-ares library ARES_OPT_TIMEOUT, but don't see an option to override that in envoy |
Setting cc @yanavlasov @mattklein123 as codeowners |
@zuercher - this is set to true by default. |
@lambdai / @howardjohn - will be glad if you can throw some light on this one. Similar to #20562 |
Appears that envoy is marking itself ready even when DNS resolution failed with the following error codes as per C-Ares:
|
After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%. The experiment proved that the issue is rooted in the way Envoy does DNS resolutions. My proposal to solve this issue/bug is three folds, each providing a fallback option if the previous one fails:
|
When there are 2 STRICT DNS clusters with the same endpoint, envoy does 2 DNS resolutions. Can the DNS resolution mechanism be optimized to avoid duplicate DNS resolutions? |
@zuercher one question on c-ares, let us say if we have 3 STRICT_DNS clusters, does c-ares open 3 persistent connections to upstream resolver? |
@nirvanagit are you setting the dns cache config? I think you should be able to aim 2 clusters at one cache and avoid the duplication |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
@alyssawilk looks like they are using regular STRICT_DNS cluster not the dynamic forward proxy. So no caching involved here? @mattklein123 @alyssawilk Can you please help answer this question and would dynamic forward proxy maintains a persistent connection for all look ups or does it tear down a connection after lookup? One of the problems we are seeing with this STRICT_DNS clusters is one of the core DNS pods is overwhelmed with lot of connections? Have you seen this? |
ah sorry I'm much more familiar with DFP than strict DNS.
I don't even know what this means? the DFP, as it uses the DNS cache, also supports stale DNS, so when DNS expires if resolution fails, you can configure the cache to use the last successful resolve result. Sounds like the problem is that strict DNS doesn't get any of these benefits - may be worth adding as an optional feature. |
Persistent connection is wrong choice of word here. When envoy sends a DNS query to core DNS, what we have observed in some environments is it is always sending to one single pod of core DNS flooding that pod. Curious if there is some thing in Envoy/C-ares that is making it to choose the same pod when DNS lookups are done for multiple STRICT_DNS clusters. |
#7965 - Found this and possible fix in c-ares c-ares/c-ares#549. Is it OK add this configuration to DNSResolverConfig? |
@alyssawilk ^^ WDYT? |
If you've tested that this addresses your problem, SGTM. |
#33551 - adding a permanent knob here |
@nirvanagit how did you mange to set timeouts on dns_resolver_config ? is that for tcp ? |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Hello @ramaraochavali, How did you apply this configuration to istio-proxy? |
Title: Observing DNS resolution timeout, resulting in UH at pod startup of istio proxy
Description:
Repro steps:
Admin and Stats Output:
Config:
Logs:
Call Stack:
The text was updated successfully, but these errors were encountered: