-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add config option for state_store.watchLimit #4986
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use constant if possible, LGTM otherwise
agent/config/builder.go
Outdated
@@ -843,6 +843,7 @@ func (b *Builder) Build() (rt RuntimeConfig, err error) { | |||
VerifyOutgoing: b.boolVal(c.VerifyOutgoing), | |||
VerifyServerHostname: b.boolVal(c.VerifyServerHostname), | |||
Watches: c.Watches, | |||
WatchSoftLimit: b.intValWithDefault(c.WatchSoftLimit, 2048), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the constant watchLimit instead?
agent/config/runtime_test.go
Outdated
@@ -3893,7 +3894,8 @@ func TestFullConfig(t *testing.T) { | |||
datacenter = "fYrl3F5d" | |||
key = "sl3Dffu7" | |||
args = ["dltjDJ2a", "flEa7C2d"] | |||
}] | |||
}], | |||
watch_soft_limit = 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, if possible use the constant
agent/config/runtime_test.go
Outdated
@@ -3347,7 +3347,8 @@ func TestFullConfig(t *testing.T) { | |||
"key": "sl3Dffu7", | |||
"args": ["dltjDJ2a", "flEa7C2d"] | |||
} | |||
] | |||
], | |||
"watch_soft_limit": 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constant
agent/config/runtime.go
Outdated
// WatchSoftLimit is used as a soft limit to cap how many watches we allow | ||
// for a given blocking query. If this is exceeded, then we will use a | ||
// higher-level watch that's less fine-grained. | ||
// Default to 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use constant
agent/config/runtime_test.go
Outdated
@@ -4538,6 +4540,7 @@ func TestFullConfig(t *testing.T) { | |||
"args": []interface{}{"dltjDJ2a", "flEa7C2d"}, | |||
}, | |||
}, | |||
WatchSoftLimit: 2048, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use watchLimit
watchLimit controls how many watches are allowed for each blocking queries. This adds a configuration option named watch_soft_limit to tweak this value.
47387f2
to
26e8841
Compare
Thanks @Aestek, this looks good at a first glance. Long term I think this is just a symptom of our watching mechanism needing to be completely replaced but that is somewhere down the priority list so this make a good stop-gap for people hitting issues currently. The reason we have this limit at all was to prevent blocking queries using unbounded memory and tons of chans and goroutines so in some cases raising it could actually cause additional churn, however the fallback is cheap but triggers often so it's only actually cheaper if you don't have regular updates or have low numbers of watchers. Making this tweakable is a reasonable compromise although we'll need to be careful how we document it since it's not at all clear without profiling on an actual workload where the tradeoff lies between greater resource usage by the blocking query verses greater resource usage due to frequent triggering and delivery of large payloads to many watchers. On that note, could we add documentation for this new config param please? ( |
Sure @banks ! |
26e8841
to
8bc4a30
Compare
@banks done |
I think @orarnon has the same issue as we did: he creates a |
@pierresouchay Yes, that's exactly right. |
Thanks for the updates Or Arnon.
For the record we held out on this PR as we are getting closer to replacing
the watching mechanism entirely with a much much more efficient version in
the near future. We'll note that here when it is ready (hopefully soon!).
Paul
…On Thu, Sep 5, 2019 at 12:06 PM Or Arnon ***@***.***> wrote:
@pierresouchay <https://github.com/pierresouchay> Yes, that's exactly
right.
Since we render all hosts into our hosts file, we have over 2K nodes at
all times which degregates performance.
When we add services flapping, it practically kills Consul cluster. Our
only mitigation was to stop Consul-Template on all hosts and let things
calm down.
Per Pierre's advice, we'll begin with listing only our data services in
our hosts file which will create several watchers but with 5-20 instances
each. So we expect much smaller hosts file with less events.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4986?email_source=notifications&email_token=AAA5QUY6T3DHJA3Q4MGWHWLQIDR3RA5CNFSM4GF5UEF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD56WZWI#issuecomment-528313561>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA5QUYIEQEB3C7LR4AYXM3QIDR3TANCNFSM4GF5UEFQ>
.
|
Is this still something that you need? |
@i0rek We are using this with value 8192 for more than 1y... it solved most of our issues on large services... Several people had this issue has well , including @orarnon and @yiliaofan An article describing the issue we had: https://medium.com/criteo-labs/anatomy-of-a-bug-when-consul-has-too-much-to-deliver-for-the-big-day-4904d19a46a4 |
as @pierresouchay stated, we have encountered this issue multiple times and it killed our production environment more than once as we rely on Consul for so many things. |
To fill in the history, this PR predates a bunch of optimization work we did as part of the issue this references.
We shipped a fix that (as far as I understood) made this optimization unnecessary at least for health/service watches in Consul 1.4.4 in March 2019. @pierresouchay does this patch still make a difference on your clusters since 1.4.4. optimisation that makes all service health queries a since watch chan anyway? Were you also exceeding this limit in other types of blocking query? I thought the only reason we kept the original issue open was to track the more advanced fixes which will eventually be streaming? I'd still rather ship streaming and improve this issue by orders of magnitude that ship yet another tunable that operators shouldn't need to have to touch but if this is actually still necessary even in 1.4.4+ that would be good to know. |
@banks We still do use the blocking queries patches and RPC (but we plan to leave it as soon as streaming can be validated on our side). However, some people, such as @orarnon had issues with 2048 limit on catalog/nodes for instance (using consul-template watch to generate host files if I am not mistaken). Not leaving the 2048 value configurable is perfectly understandable, however, increasing it to a more important value (with use 8192 for months now without any issue because it suits us with our current loads, but probably, removing it completely would be even better) might make sense until streaming comes to mainstream (I know 2 different people from 2 companies that had this issue, so this is just the tip of iceberg). Maybe at least providing a log when the limit is reached would give more insights about the breakage it creates in the real world. To me, the value 2048 is just a defensive programing practice (I would have done similar thing naturally) that does not stand the real work scenario has it breaks very quickly when it reaches the limit anyway (the remedy is worse than what it tries to avoid aka multiplying too much the number of go-routines because the O(n) explodes when the limit is reached) |
Hi, Since this change, we did not even have a leader change and could sustain volatile changes in our environment without killing Consul. |
We are experiencing this issue and are working with HashiCorp support on the matter. Given that this issue seems so prevalent at scale adding some documentation, notes or warnings would be useful for future visitors who could run into this issue and come visit this issue. |
@pierresouchay @orarnon @robloxrob, thanks for this context - that's really useful info. We'll talk about the best path forward. I'm leaning towards just changing the limit to something much higher as an interim solution until streaming is available everywhere given how likely it is that if you have large results, you also have lots of churn which is what makes the fallback to the table index way more costly in these examples. I'll leave this open until we've decided which way to go. |
@banks Thank you for your consideration in this matter. |
Thank you for helping us understand the importance of this value to be higher. We don't want to add another option for it because it is just another thing to fine tune and because we assume streaming will fix it. But we went ahead and increased it to what @pierresouchay has been suggesting: This is the reason I am closing this PR. Thanks again! |
As described here : #4984, exceeding watchLimit greatly degrades blocking queries performance.
This patch allows to configure this parameter while keeping the current value of 2048 as default, and displays a warning when the value is reached.