feat(executor): Switch to use SDK and poll-based resource status checking #5364

terrytangyuan · 2021-03-11T16:28:28Z

See discussions in #4467. Main changes are:

Switched to use k8s SDK instead of kubectl so it's more configurable and friendly to monitoring tools.
Switched to poll instead of watch so we have more control over cases where API server is unstable. Configurable via an environment variable.
Refactored and cleaned up the code a bit (many of them are legacy code introduced due to the use of kubectl) so the code is more concise and maintainable.

cc @alexec @jessesuen

Signed-off-by: terrytangyuan terrytangyuan@gmail.com

Checklist:

My organization is added to USERS.md.

…king Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

cmd/argoexec/commands/root.go

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

workflow/executor/resource.go

This reverts commit 9c7954d. Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

terrytangyuan · 2021-03-16T14:51:36Z

cmd/argoexec/commands/root.go

@@ -90,6 +90,9 @@ func initExecutor() *executor.WorkflowExecutor {
 	clientset, err := kubernetes.NewForConfig(config)
 	checkErr(err)

+	restClient := clientset.RESTClient()


@alexec We have a winner here! We can instantiate the RESTClient from the exising clientset without introducing new direct dependency (k8s.io/apiextensions-apiserver was already in go.sum partially due to this reason). The rest of the code remains the same.

This is also what kubectl uses under the hood (should've looked into its implementation in the first place!).

codecov · 2021-03-16T14:52:16Z

Codecov Report

Merging #5364 (13f08e6) into master (078feee) will increase coverage by 4.65%.
The diff coverage is 28.30%.

@@            Coverage Diff             @@
##           master    #5364      +/-   ##
==========================================
+ Coverage   11.60%   16.26%   +4.65%     
==========================================
  Files          84      243     +159     
  Lines       32932    43657   +10725     
==========================================
+ Hits         3821     7099    +3278     
- Misses      28584    35586    +7002     
- Partials      527      972     +445

Impacted Files	Coverage Δ
cmd/argo/commands/clustertemplate/create.go	`14.28% <0.00%> (+0.56%)`	⬆️
cmd/argo/commands/clustertemplate/delete.go	`0.00% <0.00%> (ø)`
cmd/argo/commands/clustertemplate/get.go	`0.00% <0.00%> (ø)`
cmd/argo/commands/clustertemplate/lint.go	`0.00% <0.00%> (ø)`
cmd/argo/commands/clustertemplate/list.go	`0.00% <0.00%> (ø)`
cmd/argo/commands/clustertemplate/root.go	`0.00% <ø> (ø)`
cmd/argo/commands/common.go	`23.07% <0.00%> (-7.70%)`	⬇️
cmd/argo/commands/completion.go	`0.00% <ø> (ø)`
cmd/argo/commands/cron/create.go	`0.00% <0.00%> (ø)`
cmd/argo/commands/cron/delete.go	`0.00% <0.00%> (ø)`
... and 276 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0f71f3...e8e0f16. Read the comment docs.

alexec · 2021-03-16T15:50:22Z

I'm not sure we want to change from watch. Watch gives rapid feedback for short-running resources. It only transfers the data needed to do so. Instead, could we focus on making the watch code more robust? There are a few examples of using something like

for {
select {
case <-ctx.Done():
return ctx.Err()
default event, ok <- w.ResultChan():
// other chechk
}

In the code that can be used.

terrytangyuan · 2021-03-16T17:46:09Z

That's one of the items I brought up for discussion in #4467.

We'd like to stay away from long-running connections via kubectl get -w for various reasons, e.g. better control of status checking calls. We observe constant connection drops for some of our clusters and it had to re-establish the watch so the -w here doesn't help much in this case.

In other words, maintaining watches for custom resources is not very controllable, especially when using with executors like k8sapi where etcd and apiserver are under pressure. In case the number of custom resources gets really large and somehow the connections with apiserver drops, we have to re-establish watches for all of them, which could bring the cluster to a bad state.

Here's the reply from @jessesuen in that thread:

So are you suggesting we poll instead of watch? I think that would be reasonable and less taxing on API server. I guess the next question is how will users get the ability to control this interval

So as long as we are able to control the interval we should be fine. We've experimented this at very large scale and this more controllable approach works well for us.

feat(executor): Switch to use SDK and poll-based resource status chec…

f9fed51

…king Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

alexec self-assigned this Mar 15, 2021

alexec reviewed Mar 15, 2021

View reviewed changes

cmd/argoexec/commands/root.go Outdated Show resolved Hide resolved

Switch to use dynamic.Interface

9c7954d

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

alexec reviewed Mar 16, 2021

View reviewed changes

workflow/executor/resource.go Outdated Show resolved Hide resolved

terrytangyuan added 2 commits March 16, 2021 10:26

Revert "Switch to use dynamic.Interface"

e87ab72

This reverts commit 9c7954d. Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

Switch to instantiate RESTClient from clientset directly

e8e0f16

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

terrytangyuan commented Mar 16, 2021

View reviewed changes

alexec approved these changes Mar 16, 2021

View reviewed changes

alexec added this to the v3.1 milestone Mar 16, 2021

alexec merged commit ed957dd into argoproj:master Mar 16, 2021

terrytangyuan deleted the k8s-sdk-resource-status branch March 16, 2021 17:52

simster7 mentioned this pull request Mar 22, 2021

v3.0.0-rc9 cherry-pick #5488

Closed

34 tasks

simster7 mentioned this pull request Mar 29, 2021

v2.12.11 cherry-pick #5542

Closed

77 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(executor): Switch to use SDK and poll-based resource status checking #5364

feat(executor): Switch to use SDK and poll-based resource status checking #5364

terrytangyuan commented Mar 11, 2021 •

edited

Loading

terrytangyuan Mar 16, 2021

codecov bot commented Mar 16, 2021

alexec commented Mar 16, 2021

terrytangyuan commented Mar 16, 2021 •

edited

Loading

feat(executor): Switch to use SDK and poll-based resource status checking #5364

feat(executor): Switch to use SDK and poll-based resource status checking #5364

Conversation

terrytangyuan commented Mar 11, 2021 • edited Loading

terrytangyuan Mar 16, 2021

Choose a reason for hiding this comment

codecov bot commented Mar 16, 2021

Codecov Report

alexec commented Mar 16, 2021

terrytangyuan commented Mar 16, 2021 • edited Loading

terrytangyuan commented Mar 11, 2021 •

edited

Loading

terrytangyuan commented Mar 16, 2021 •

edited

Loading