Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(executor): Switch to use SDK and poll-based resource status checking #5364

Merged
merged 4 commits into from
Mar 16, 2021
Merged

feat(executor): Switch to use SDK and poll-based resource status checking #5364

merged 4 commits into from
Mar 16, 2021

Conversation

terrytangyuan
Copy link
Member

@terrytangyuan terrytangyuan commented Mar 11, 2021

See discussions in #4467. Main changes are:

  1. Switched to use k8s SDK instead of kubectl so it's more configurable and friendly to monitoring tools.
  2. Switched to poll instead of watch so we have more control over cases where API server is unstable. Configurable via an environment variable.
  3. Refactored and cleaned up the code a bit (many of them are legacy code introduced due to the use of kubectl) so the code is more concise and maintainable.

cc @alexec @jessesuen

Signed-off-by: terrytangyuan terrytangyuan@gmail.com

Checklist:

…king

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
@alexec alexec self-assigned this Mar 15, 2021
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
This reverts commit 9c7954d.

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
@@ -90,6 +90,9 @@ func initExecutor() *executor.WorkflowExecutor {
clientset, err := kubernetes.NewForConfig(config)
checkErr(err)

restClient := clientset.RESTClient()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexec We have a winner here! We can instantiate the RESTClient from the exising clientset without introducing new direct dependency (k8s.io/apiextensions-apiserver was already in go.sum partially due to this reason). The rest of the code remains the same.

This is also what kubectl uses under the hood (should've looked into its implementation in the first place!).

@codecov
Copy link

codecov bot commented Mar 16, 2021

Codecov Report

Merging #5364 (13f08e6) into master (078feee) will increase coverage by 4.65%.
The diff coverage is 28.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5364      +/-   ##
==========================================
+ Coverage   11.60%   16.26%   +4.65%     
==========================================
  Files          84      243     +159     
  Lines       32932    43657   +10725     
==========================================
+ Hits         3821     7099    +3278     
- Misses      28584    35586    +7002     
- Partials      527      972     +445     
Impacted Files Coverage Δ
cmd/argo/commands/clustertemplate/create.go 14.28% <0.00%> (+0.56%) ⬆️
cmd/argo/commands/clustertemplate/delete.go 0.00% <0.00%> (ø)
cmd/argo/commands/clustertemplate/get.go 0.00% <0.00%> (ø)
cmd/argo/commands/clustertemplate/lint.go 0.00% <0.00%> (ø)
cmd/argo/commands/clustertemplate/list.go 0.00% <0.00%> (ø)
cmd/argo/commands/clustertemplate/root.go 0.00% <ø> (ø)
cmd/argo/commands/common.go 23.07% <0.00%> (-7.70%) ⬇️
cmd/argo/commands/completion.go 0.00% <ø> (ø)
cmd/argo/commands/cron/create.go 0.00% <0.00%> (ø)
cmd/argo/commands/cron/delete.go 0.00% <0.00%> (ø)
... and 276 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0f71f3...e8e0f16. Read the comment docs.

@alexec
Copy link
Contributor

alexec commented Mar 16, 2021

I'm not sure we want to change from watch. Watch gives rapid feedback for short-running resources. It only transfers the data needed to do so. Instead, could we focus on making the watch code more robust? There are a few examples of using something like

for {
select {
case <-ctx.Done():
return ctx.Err()
default event, ok <- w.ResultChan():
// other chechk
}

In the code that can be used.

@terrytangyuan
Copy link
Member Author

terrytangyuan commented Mar 16, 2021

That's one of the items I brought up for discussion in #4467.

We'd like to stay away from long-running connections via kubectl get -w for various reasons, e.g. better control of status checking calls. We observe constant connection drops for some of our clusters and it had to re-establish the watch so the -w here doesn't help much in this case.

In other words, maintaining watches for custom resources is not very controllable, especially when using with executors like k8sapi where etcd and apiserver are under pressure. In case the number of custom resources gets really large and somehow the connections with apiserver drops, we have to re-establish watches for all of them, which could bring the cluster to a bad state.

Here's the reply from @jessesuen in that thread:

So are you suggesting we poll instead of watch? I think that would be reasonable and less taxing on API server. I guess the next question is how will users get the ability to control this interval

So as long as we are able to control the interval we should be fine. We've experimented this at very large scale and this more controllable approach works well for us.

@alexec alexec added this to the v3.1 milestone Mar 16, 2021
@alexec alexec merged commit ed957dd into argoproj:master Mar 16, 2021
@terrytangyuan terrytangyuan deleted the k8s-sdk-resource-status branch March 16, 2021 17:52
@simster7 simster7 mentioned this pull request Mar 22, 2021
34 tasks
@simster7 simster7 mentioned this pull request Mar 29, 2021
77 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants