Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config option for state_store.watchLimit #4986

Closed

Conversation

ShimmerGlass
Copy link
Contributor

As described here : #4984, exceeding watchLimit greatly degrades blocking queries performance.
This patch allows to configure this parameter while keeping the current value of 2048 as default, and displays a warning when the value is reached.

Copy link
Contributor

@pierresouchay pierresouchay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constant if possible, LGTM otherwise

@@ -843,6 +843,7 @@ func (b *Builder) Build() (rt RuntimeConfig, err error) {
VerifyOutgoing: b.boolVal(c.VerifyOutgoing),
VerifyServerHostname: b.boolVal(c.VerifyServerHostname),
Watches: c.Watches,
WatchSoftLimit: b.intValWithDefault(c.WatchSoftLimit, 2048),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the constant watchLimit instead?

@@ -3893,7 +3894,8 @@ func TestFullConfig(t *testing.T) {
datacenter = "fYrl3F5d"
key = "sl3Dffu7"
args = ["dltjDJ2a", "flEa7C2d"]
}]
}],
watch_soft_limit = 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, if possible use the constant

@@ -3347,7 +3347,8 @@ func TestFullConfig(t *testing.T) {
"key": "sl3Dffu7",
"args": ["dltjDJ2a", "flEa7C2d"]
}
]
],
"watch_soft_limit": 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant

// WatchSoftLimit is used as a soft limit to cap how many watches we allow
// for a given blocking query. If this is exceeded, then we will use a
// higher-level watch that's less fine-grained.
// Default to 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use constant

@@ -4538,6 +4540,7 @@ func TestFullConfig(t *testing.T) {
"args": []interface{}{"dltjDJ2a", "flEa7C2d"},
},
},
WatchSoftLimit: 2048,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use watchLimit

watchLimit controls how many watches are allowed for each blocking
queries. This adds a configuration option named watch_soft_limit to
tweak this value.
@ShimmerGlass ShimmerGlass force-pushed the watch-limit-option branch 2 times, most recently from 47387f2 to 26e8841 Compare November 23, 2018 13:20
@banks
Copy link
Member

banks commented Nov 26, 2018

Thanks @Aestek, this looks good at a first glance.

Long term I think this is just a symptom of our watching mechanism needing to be completely replaced but that is somewhere down the priority list so this make a good stop-gap for people hitting issues currently.

The reason we have this limit at all was to prevent blocking queries using unbounded memory and tons of chans and goroutines so in some cases raising it could actually cause additional churn, however the fallback is cheap but triggers often so it's only actually cheaper if you don't have regular updates or have low numbers of watchers.

Making this tweakable is a reasonable compromise although we'll need to be careful how we document it since it's not at all clear without profiling on an actual workload where the tradeoff lies between greater resource usage by the blocking query verses greater resource usage due to frequent triggering and delivery of large payloads to many watchers.

On that note, could we add documentation for this new config param please? (website/source/docs/agent/options.html.md off the top of my head!) Exact wording might take some iteration to get right but maybe look at the warnings we put on gossip tunables we exposed recently?

@ShimmerGlass
Copy link
Contributor Author

Sure @banks !

@ShimmerGlass
Copy link
Contributor Author

@banks done

@pierresouchay
Copy link
Contributor

I think @orarnon has the same issue as we did:

he creates a /etc/hosts file using consul-template with all the nodes with more than 2048 nodes => performance drop when services do flap. Shutting down consul-template solves the performance.

@orarnon
Copy link

orarnon commented Sep 5, 2019

@pierresouchay Yes, that's exactly right.
Since we render all hosts into our hosts file, we have over 2K nodes at all times which degregates performance.
When we add services flapping, it practically kills Consul cluster. Our only mitigation was to stop Consul-Template on all hosts and let things calm down.
Per Pierre's advice, we'll begin with listing only our data services in our hosts file which will create several watchers but with 5-20 instances each. So we expect much smaller hosts file with less events.

@banks
Copy link
Member

banks commented Sep 5, 2019 via email

@hanshasselberg
Copy link
Member

Is this still something that you need?

@hanshasselberg hanshasselberg self-assigned this Jan 22, 2020
@hanshasselberg hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 22, 2020
@pierresouchay
Copy link
Contributor

@i0rek We are using this with value 8192 for more than 1y... it solved most of our issues on large services...

Several people had this issue has well , including @orarnon and @yiliaofan
For what we saw (and until streaming is available, this is one of the most important patch for large infrastructures and having a greater limit than 2018 by default would avoid many issues).

An article describing the issue we had: https://medium.com/criteo-labs/anatomy-of-a-bug-when-consul-has-too-much-to-deliver-for-the-big-day-4904d19a46a4

@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 22, 2020
@orarnon
Copy link

orarnon commented Jan 26, 2020

as @pierresouchay stated, we have encountered this issue multiple times and it killed our production environment more than once as we rely on Consul for so many things.

@banks
Copy link
Member

banks commented Jan 27, 2020

To fill in the history, this PR predates a bunch of optimization work we did as part of the issue this references.

For what we saw (and until streaming is available, this is one of the most important patch for large infrastructures

We shipped a fix that (as far as I understood) made this optimization unnecessary at least for health/service watches in Consul 1.4.4 in March 2019.

@pierresouchay does this patch still make a difference on your clusters since 1.4.4. optimisation that makes all service health queries a since watch chan anyway? Were you also exceeding this limit in other types of blocking query?

I thought the only reason we kept the original issue open was to track the more advanced fixes which will eventually be streaming?

I'd still rather ship streaming and improve this issue by orders of magnitude that ship yet another tunable that operators shouldn't need to have to touch but if this is actually still necessary even in 1.4.4+ that would be good to know.

@pierresouchay
Copy link
Contributor

@banks We still do use the blocking queries patches and RPC (but we plan to leave it as soon as streaming can be validated on our side). However, some people, such as @orarnon had issues with 2048 limit on catalog/nodes for instance (using consul-template watch to generate host files if I am not mistaken).

Not leaving the 2048 value configurable is perfectly understandable, however, increasing it to a more important value (with use 8192 for months now without any issue because it suits us with our current loads, but probably, removing it completely would be even better) might make sense until streaming comes to mainstream (I know 2 different people from 2 companies that had this issue, so this is just the tip of iceberg). Maybe at least providing a log when the limit is reached would give more insights about the breakage it creates in the real world.

To me, the value 2048 is just a defensive programing practice (I would have done similar thing naturally) that does not stand the real work scenario has it breaks very quickly when it reaches the limit anyway (the remedy is worse than what it tries to avoid aka multiplying too much the number of go-routines because the O(n) explodes when the limit is reached)

@orarnon
Copy link

orarnon commented Jan 29, 2020

Hi,
@pierresouchay is right. Since there's no documentation or logs on this matter, we were puzzled for weeks as to why our Consul cluster was not handling the load despite the fact that it was over-provisioned and optimized. It caused us multiple downtimes to our production environment to the point where we considered replacing Consul with another solution.
After a short conversation with Pierre, where he explained this issue to me, we have reduced out hosts file to show data services only. This change reduced the number of hosts below the 2048 threshold.

Since this change, we did not even have a leader change and could sustain volatile changes in our environment without killing Consul.
@banks I urge you to find a streaming solution one way or another. Personally, I explained this issue to a couple of companies as well and helped them solved that mystery.

@robloxrob
Copy link

We are experiencing this issue and are working with HashiCorp support on the matter. Given that this issue seems so prevalent at scale adding some documentation, notes or warnings would be useful for future visitors who could run into this issue and come visit this issue.

@banks
Copy link
Member

banks commented Jan 29, 2020

@pierresouchay @orarnon @robloxrob, thanks for this context - that's really useful info.

We'll talk about the best path forward. I'm leaning towards just changing the limit to something much higher as an interim solution until streaming is available everywhere given how likely it is that if you have large results, you also have lots of churn which is what makes the fallback to the table index way more costly in these examples.

I'll leave this open until we've decided which way to go.

@robloxrob
Copy link

@banks Thank you for your consideration in this matter.

@hanshasselberg hanshasselberg added the thinking More time is needed to research by the Consul Contributors label Feb 3, 2020
hanshasselberg added a commit that referenced this pull request Feb 3, 2020
The previous value was too conservative and users with many instances
were having problems because of it. This change increases the limit to
8192 which reportedly fixed most of the issues with that.

Related: #4984, #4986, #5050.
@hanshasselberg
Copy link
Member

Thank you for helping us understand the importance of this value to be higher. We don't want to add another option for it because it is just another thing to fine tune and because we assume streaming will fix it. But we went ahead and increased it to what @pierresouchay has been suggesting: 8192: #7200.

This is the reason I am closing this PR. Thanks again!

hanshasselberg added a commit that referenced this pull request Feb 4, 2020
The previous value was too conservative and users with many instances
were having problems because of it. This change increases the limit to
8192 which reportedly fixed most of the issues with that.

Related: #4984, #4986, #5050.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thinking More time is needed to research by the Consul Contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants