Add config option for state_store.watchLimit #4986

ShimmerGlass · 2018-11-22T14:01:47Z

As described here : #4984, exceeding watchLimit greatly degrades blocking queries performance.
This patch allows to configure this parameter while keeping the current value of 2048 as default, and displays a warning when the value is reached.

pierresouchay

Use constant if possible, LGTM otherwise

pierresouchay · 2018-11-22T15:47:22Z

agent/config/builder.go

@@ -843,6 +843,7 @@ func (b *Builder) Build() (rt RuntimeConfig, err error) {
 		VerifyOutgoing:                          b.boolVal(c.VerifyOutgoing),
 		VerifyServerHostname:                    b.boolVal(c.VerifyServerHostname),
 		Watches:                                 c.Watches,
+		WatchSoftLimit:                          b.intValWithDefault(c.WatchSoftLimit, 2048),


Can you use the constant watchLimit instead?

pierresouchay · 2018-11-22T15:47:40Z

agent/config/runtime_test.go

@@ -3893,7 +3894,8 @@ func TestFullConfig(t *testing.T) {
 				datacenter = "fYrl3F5d"
 				key = "sl3Dffu7"
 				args = ["dltjDJ2a", "flEa7C2d"]
-			}]
+			}],
+			watch_soft_limit = 2048


same, if possible use the constant

pierresouchay · 2018-11-22T15:48:12Z

agent/config/runtime_test.go

@@ -3347,7 +3347,8 @@ func TestFullConfig(t *testing.T) {
 					"key": "sl3Dffu7",
 					"args": ["dltjDJ2a", "flEa7C2d"]
 				}
-			]
+			],
+			"watch_soft_limit": 2048


pierresouchay · 2018-11-22T15:48:28Z

agent/config/runtime.go

+	// WatchSoftLimit is used as a soft limit to cap how many watches we allow
+	// for a given blocking query. If this is exceeded, then we will use a
+	// higher-level watch that's less fine-grained.
+	// Default to 2048


use constant

pierresouchay · 2018-11-22T15:48:44Z

agent/config/runtime_test.go

@@ -4538,6 +4540,7 @@ func TestFullConfig(t *testing.T) {
 				"args":       []interface{}{"dltjDJ2a", "flEa7C2d"},
 			},
 		},
+		WatchSoftLimit: 2048,


use watchLimit

watchLimit controls how many watches are allowed for each blocking queries. This adds a configuration option named watch_soft_limit to tweak this value.

banks · 2018-11-26T12:26:23Z

Thanks @Aestek, this looks good at a first glance.

Long term I think this is just a symptom of our watching mechanism needing to be completely replaced but that is somewhere down the priority list so this make a good stop-gap for people hitting issues currently.

The reason we have this limit at all was to prevent blocking queries using unbounded memory and tons of chans and goroutines so in some cases raising it could actually cause additional churn, however the fallback is cheap but triggers often so it's only actually cheaper if you don't have regular updates or have low numbers of watchers.

Making this tweakable is a reasonable compromise although we'll need to be careful how we document it since it's not at all clear without profiling on an actual workload where the tradeoff lies between greater resource usage by the blocking query verses greater resource usage due to frequent triggering and delivery of large payloads to many watchers.

On that note, could we add documentation for this new config param please? (website/source/docs/agent/options.html.md off the top of my head!) Exact wording might take some iteration to get right but maybe look at the warnings we put on gossip tunables we exposed recently?

ShimmerGlass · 2018-11-26T13:04:09Z

Sure @banks !

ShimmerGlass · 2018-11-26T15:07:31Z

@banks done

pierresouchay · 2019-09-05T09:34:44Z

I think @orarnon has the same issue as we did:

he creates a /etc/hosts file using consul-template with all the nodes with more than 2048 nodes => performance drop when services do flap. Shutting down consul-template solves the performance.

orarnon · 2019-09-05T11:06:29Z

@pierresouchay Yes, that's exactly right.
Since we render all hosts into our hosts file, we have over 2K nodes at all times which degregates performance.
When we add services flapping, it practically kills Consul cluster. Our only mitigation was to stop Consul-Template on all hosts and let things calm down.
Per Pierre's advice, we'll begin with listing only our data services in our hosts file which will create several watchers but with 5-20 instances each. So we expect much smaller hosts file with less events.

banks · 2019-09-05T11:59:34Z

Thanks for the updates Or Arnon. For the record we held out on this PR as we are getting closer to replacing the watching mechanism entirely with a much much more efficient version in the near future. We'll note that here when it is ready (hopefully soon!). Paul

…

On Thu, Sep 5, 2019 at 12:06 PM Or Arnon ***@***.***> wrote: @pierresouchay <https://github.com/pierresouchay> Yes, that's exactly right. Since we render all hosts into our hosts file, we have over 2K nodes at all times which degregates performance. When we add services flapping, it practically kills Consul cluster. Our only mitigation was to stop Consul-Template on all hosts and let things calm down. Per Pierre's advice, we'll begin with listing only our data services in our hosts file which will create several watchers but with 5-20 instances each. So we expect much smaller hosts file with less events. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4986?email_source=notifications&email_token=AAA5QUY6T3DHJA3Q4MGWHWLQIDR3RA5CNFSM4GF5UEF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD56WZWI#issuecomment-528313561>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA5QUYIEQEB3C7LR4AYXM3QIDR3TANCNFSM4GF5UEFQ> .

hanshasselberg · 2020-01-22T15:57:10Z

Is this still something that you need?

pierresouchay · 2020-01-22T16:03:13Z

@i0rek We are using this with value 8192 for more than 1y... it solved most of our issues on large services...

Several people had this issue has well , including @orarnon and @yiliaofan
For what we saw (and until streaming is available, this is one of the most important patch for large infrastructures and having a greater limit than 2018 by default would avoid many issues).

An article describing the issue we had: https://medium.com/criteo-labs/anatomy-of-a-bug-when-consul-has-too-much-to-deliver-for-the-big-day-4904d19a46a4

orarnon · 2020-01-26T12:45:54Z

as @pierresouchay stated, we have encountered this issue multiple times and it killed our production environment more than once as we rely on Consul for so many things.

banks · 2020-01-27T13:57:42Z

To fill in the history, this PR predates a bunch of optimization work we did as part of the issue this references.

For what we saw (and until streaming is available, this is one of the most important patch for large infrastructures

We shipped a fix that (as far as I understood) made this optimization unnecessary at least for health/service watches in Consul 1.4.4 in March 2019.

@pierresouchay does this patch still make a difference on your clusters since 1.4.4. optimisation that makes all service health queries a since watch chan anyway? Were you also exceeding this limit in other types of blocking query?

I thought the only reason we kept the original issue open was to track the more advanced fixes which will eventually be streaming?

I'd still rather ship streaming and improve this issue by orders of magnitude that ship yet another tunable that operators shouldn't need to have to touch but if this is actually still necessary even in 1.4.4+ that would be good to know.

pierresouchay · 2020-01-28T21:35:14Z

@banks We still do use the blocking queries patches and RPC (but we plan to leave it as soon as streaming can be validated on our side). However, some people, such as @orarnon had issues with 2048 limit on catalog/nodes for instance (using consul-template watch to generate host files if I am not mistaken).

Not leaving the 2048 value configurable is perfectly understandable, however, increasing it to a more important value (with use 8192 for months now without any issue because it suits us with our current loads, but probably, removing it completely would be even better) might make sense until streaming comes to mainstream (I know 2 different people from 2 companies that had this issue, so this is just the tip of iceberg). Maybe at least providing a log when the limit is reached would give more insights about the breakage it creates in the real world.

To me, the value 2048 is just a defensive programing practice (I would have done similar thing naturally) that does not stand the real work scenario has it breaks very quickly when it reaches the limit anyway (the remedy is worse than what it tries to avoid aka multiplying too much the number of go-routines because the O(n) explodes when the limit is reached)

orarnon · 2020-01-29T09:36:28Z

Hi,
@pierresouchay is right. Since there's no documentation or logs on this matter, we were puzzled for weeks as to why our Consul cluster was not handling the load despite the fact that it was over-provisioned and optimized. It caused us multiple downtimes to our production environment to the point where we considered replacing Consul with another solution.
After a short conversation with Pierre, where he explained this issue to me, we have reduced out hosts file to show data services only. This change reduced the number of hosts below the 2048 threshold.

Since this change, we did not even have a leader change and could sustain volatile changes in our environment without killing Consul.
@banks I urge you to find a streaming solution one way or another. Personally, I explained this issue to a couple of companies as well and helped them solved that mystery.

robloxrob · 2020-01-29T17:14:54Z

We are experiencing this issue and are working with HashiCorp support on the matter. Given that this issue seems so prevalent at scale adding some documentation, notes or warnings would be useful for future visitors who could run into this issue and come visit this issue.

banks · 2020-01-29T18:18:33Z

@pierresouchay @orarnon @robloxrob, thanks for this context - that's really useful info.

We'll talk about the best path forward. I'm leaning towards just changing the limit to something much higher as an interim solution until streaming is available everywhere given how likely it is that if you have large results, you also have lots of churn which is what makes the fallback to the table index way more costly in these examples.

I'll leave this open until we've decided which way to go.

robloxrob · 2020-01-30T19:03:48Z

@banks Thank you for your consideration in this matter.

The previous value was too conservative and users with many instances were having problems because of it. This change increases the limit to 8192 which reportedly fixed most of the issues with that. Related: #4984, #4986, #5050.

hanshasselberg · 2020-02-03T11:53:29Z

Thank you for helping us understand the importance of this value to be higher. We don't want to add another option for it because it is just another thing to fine tune and because we assume streaming will fix it. But we went ahead and increased it to what @pierresouchay has been suggesting: 8192: #7200.

This is the reason I am closing this PR. Thanks again!

The previous value was too conservative and users with many instances were having problems because of it. This change increases the limit to 8192 which reportedly fixed most of the issues with that. Related: #4984, #4986, #5050.

pierresouchay suggested changes Nov 22, 2018

View reviewed changes

Add an option to configure watchLimit in state_store.go

f9b3920

watchLimit controls how many watches are allowed for each blocking queries. This adds a configuration option named watch_soft_limit to tweak this value.

ShimmerGlass force-pushed the watch-limit-option branch 2 times, most recently from 47387f2 to 26e8841 Compare November 23, 2018 13:20

Add log in case softWatchLimit is exceeded during a blocking query

8bc4a30

ShimmerGlass force-pushed the watch-limit-option branch from 26e8841 to 8bc4a30 Compare November 26, 2018 15:06

pierresouchay mentioned this pull request Nov 30, 2018

[Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984

Closed

pierresouchay mentioned this pull request Dec 7, 2018

[Performance on large clusters] Share blocking queries between RPC requests #5050

Closed

hanshasselberg self-assigned this Jan 22, 2020

hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 22, 2020

stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 22, 2020

hanshasselberg added the thinking More time is needed to research by the Consul Contributors label Feb 3, 2020

hanshasselberg mentioned this pull request Feb 3, 2020

Increase watchLimit to 8192. #7200

Merged

hanshasselberg closed this Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add config option for state_store.watchLimit #4986

Add config option for state_store.watchLimit #4986

ShimmerGlass commented Nov 22, 2018

pierresouchay left a comment

pierresouchay Nov 22, 2018

pierresouchay Nov 22, 2018

pierresouchay Nov 22, 2018

pierresouchay Nov 22, 2018

pierresouchay Nov 22, 2018

banks commented Nov 26, 2018

ShimmerGlass commented Nov 26, 2018

ShimmerGlass commented Nov 26, 2018

pierresouchay commented Sep 5, 2019

orarnon commented Sep 5, 2019

banks commented Sep 5, 2019 via email

hanshasselberg commented Jan 22, 2020

pierresouchay commented Jan 22, 2020

orarnon commented Jan 26, 2020

banks commented Jan 27, 2020

pierresouchay commented Jan 28, 2020

orarnon commented Jan 29, 2020

robloxrob commented Jan 29, 2020

banks commented Jan 29, 2020

robloxrob commented Jan 30, 2020

hanshasselberg commented Feb 3, 2020

Add config option for state_store.watchLimit #4986

Add config option for state_store.watchLimit #4986

Conversation

ShimmerGlass commented Nov 22, 2018

pierresouchay left a comment

Choose a reason for hiding this comment

pierresouchay Nov 22, 2018

Choose a reason for hiding this comment

pierresouchay Nov 22, 2018

Choose a reason for hiding this comment

pierresouchay Nov 22, 2018

Choose a reason for hiding this comment

pierresouchay Nov 22, 2018

Choose a reason for hiding this comment

pierresouchay Nov 22, 2018

Choose a reason for hiding this comment

banks commented Nov 26, 2018

ShimmerGlass commented Nov 26, 2018

ShimmerGlass commented Nov 26, 2018

pierresouchay commented Sep 5, 2019

orarnon commented Sep 5, 2019

banks commented Sep 5, 2019 via email

hanshasselberg commented Jan 22, 2020

pierresouchay commented Jan 22, 2020

orarnon commented Jan 26, 2020

banks commented Jan 27, 2020

pierresouchay commented Jan 28, 2020

orarnon commented Jan 29, 2020

robloxrob commented Jan 29, 2020

banks commented Jan 29, 2020

robloxrob commented Jan 30, 2020

hanshasselberg commented Feb 3, 2020