Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async batch #50546

Merged
merged 29 commits into from
Feb 25, 2019
Merged

Async batch #50546

merged 29 commits into from
Feb 25, 2019

Conversation

dincamihai
Copy link
Contributor

@dincamihai dincamihai commented Nov 16, 2018

What does this PR do?

Implementation of RFC#0002

What issues does this PR fix or reference?

When doing a request to /run endpoint with the following payload:

{
    "client": "local_async",
    "eauth": "auto",
    "username": "admin",
    "password": "admin",
    "tgt": "*",
    "fun": "cmd.run",
    "arg": ["sleep $((RANDOM % 30)) && echo hello"],
    "timeout": 5,
    "gather_job_timeout": 5,
    "batch_delay": 1,
    "batch": "2"
}

A response is returned immediately. The response is empty (please ignore that for now) but it basically looks like this: https://github.com/saltstack/salt/blob/7f6e5d89a48bccf19cb45ed52a3a54f5cc53f400/salt/master.py#L2070-L2077

After returning the response, the batch executes a test.ping. Once the test.ping is done an event is published salt/batch/<batch-jid>/start with the following data: https://github.com/saltstack/salt/blob/7d4cae95f3fd54f10ad79544d4c9a2074ab11ed4/salt/cli/batch_async.py#L190-L194

When the batch finishes, it fires salt/batch/<batch-jid>/done event: https://github.com/saltstack/salt/blob/7d4cae95f3fd54f10ad79544d4c9a2074ab11ed4/salt/cli/batch_async.py#L199-L205

Previous Behavior

The response is only returned after the batch job was executed.

New Behavior

The response is returned immediately and the batch job continues to run as tornado coroutines.

Tests written?

TODO

@dincamihai dincamihai requested a review from a team as a code owner November 16, 2018 16:26
@ghost ghost requested review from a team November 16, 2018 16:26
@dincamihai
Copy link
Contributor Author

@moio JFYI and, of course, ideas are welcome.

@moio
Copy link
Contributor

moio commented Nov 20, 2018

What is the value of gather_job_timeout used in cmd_iter in the two cases? Is it the same or different?

@dincamihai
Copy link
Contributor Author

@moio it is the same value, 10s.
Increasing to 30s for the API call does not result in more minions replying to the ping, it just waits longer for the minion that did not reply.
See below the effect of the 30s gather_job_timeout.
It waits from 10:39:29,861 to 10:39:59,969

2018-11-20 10:39:24,734 - master._do_batching - running
2018-11-20 10:39:24,735 - master._run_batches - running
2018-11-20 10:39:24,735 - master._run_batches - done
2018-11-20 10:39:24,757 - batch.__gather_minions - 30
2018-11-20 10:39:24,760 - master.publish - 20181120103924759543 test.ping * []
2018-11-20 10:39:24,816 - master._return - 20181120103924759543 id_JfQHm test.ping
2018-11-20 10:39:24,821 - master._return - 20181120103924759543 id_QRXlv test.ping
2018-11-20 10:39:24,834 - master._return - 20181120103924759543 id_afdeZ test.ping
2018-11-20 10:39:24,852 - master._return - 20181120103924759543 id_qyUbO test.ping
2018-11-20 10:39:29,837 - master.publish - 20181120103929834283 saltutil.find_job ['id_LPOqE'] ['20181120103924759543']
2018-11-20 10:39:29,861 - master._return - 20181120103929834283 id_LPOqE saltutil.find_job
2018-11-20 10:39:59,969 - master.publish - 20181120103959966032 cmd.run ['id_QRXlv', 'id_afdeZ'] ['sleep $((RANDOM % 3)) && echo hello']                                                                                                                                      
2018-11-20 10:40:01,248 - master._return - 20181120103959966032 id_QRXlv cmd.run
2018-11-20 10:40:02,578 - master.publish - 20181120104002577099 cmd.run ['id_JfQHm'] ['sleep $((RANDOM % 3)) && echo hello']
2018-11-20 10:40:02,860 - master._return - 20181120104002577099 id_JfQHm cmd.run
2018-11-20 10:40:03,632 - master.publish - 20181120104003630855 cmd.run ['id_qyUbO'] ['sleep $((RANDOM % 3)) && echo hello']
2018-11-20 10:40:05,678 - master.publish - 20181120104005674763 saltutil.find_job ['id_afdeZ'] ['20181120103959966032']
2018-11-20 10:40:05,723 - master._return - 20181120104005674763 id_afdeZ saltutil.find_job
2018-11-20 10:40:08,743 - master.publish - 20181120104008741501 saltutil.find_job ['id_qyUbO'] ['20181120104003630855']
2018-11-20 10:40:08,788 - master._return - 20181120104008741501 id_qyUbO saltutil.find_job
2018-11-20 10:40:38,926 - batch.run - done
2018-11-20 10:40:38,927 - master._return - 20181120103924759543 id_LPOqE test.ping
2018-11-20 10:40:38,933 - master._return - 20181120103959966032 id_afdeZ cmd.run
2018-11-20 10:40:38,936 - master._return - 20181120104003630855 id_qyUbO cmd.run

Something happens at some point and the test.ping returns are not coming anymore and then when the batching starts the batch job returns come and only after batch finishes the initial ping returns appear.

@moio
Copy link
Contributor

moio commented Nov 20, 2018

Do the pings reach the minions in the first place? What does salt-run state.event pretty=True say?

salt/client/__init__.py Outdated Show resolved Hide resolved
@dincamihai
Copy link
Contributor Author

dincamihai commented Nov 21, 2018

this PR includes now the refactoring from #49818

@dincamihai dincamihai force-pushed the async-batch branch 3 times, most recently from 9b937a3 to 293aded Compare November 28, 2018 14:03
@dincamihai
Copy link
Contributor Author

dincamihai commented Nov 28, 2018

@moio @cachedout I have updated my PR.
I have decided that the old implementation of Batch is not useful here so I have implemented the async version of batching almost from scratch.
It is using run_job_async to do the initial ping and to publish the job in batches. It is also registering an event handler to gather the returns.

NOTE: the async batching does not include the ssh-minions (we have to see how this can be implemented)

NOTE: there is still some cleanup to be done but I wanted to get some opinions as soon as possible (that's why I removed the WIP)

@dincamihai dincamihai changed the title [WIP] Async batch Async batch Nov 28, 2018
@cachedout
Copy link
Contributor

@dincamihai Can we start with having you please fix these lint errors? https://jenkinsci.saltstack.com/job/pr-lint/job/PR-50546/13/warnings52Result/new/file.-1608698382/

@dincamihai dincamihai force-pushed the async-batch branch 6 times, most recently from 02360ab to df155a7 Compare December 4, 2018 08:40
Copy link
Contributor

@moio moio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dincamihai, this looks good!

I noted some questions/doubts inline. Please pardon my Python inexperience, some points might be trivial or downright wrong.

Apart from them and from tests which I understand being written, the main conceptual question is actually about the following note in the PR description:

The response is empty (please ignore that for now) but it basically looks like this: {"return": [{}]}

What the caller is interested in, after the call is placed, is the list of minions expected to return (those that responded in time to the initial ping and that are thus included in the batch queue).

I can see two ways to address this:

  1. we make the initial ping part blocking, thus the response could contain a minion list (and still proceed asynchronously for the bulk of the batch)
  2. we return nothing/true/any other dummy value, and then provide a way to get to this list (I can presently only imagine via an event)

Frankly speaking, looking at current Uyuni needs, 1. would fit better. Still I will not hide the downside: any request could take up to gather_job_timeout seconds to return (gather_job_timeout seconds being the worst case that would happen when at least one targeted minion takes more than gather_job_timeout to respond, or is down).

How difficult is it to implement one or the other approach, or even even making it configurable?

Do you see aspects I am forgetting about in this discussion?

Do Salt maintainers have any specific remark?

Thanks again for all the efforts here!

salt/cli/batch.py Outdated Show resolved Hide resolved
salt/cli/batch_async.py Outdated Show resolved Hide resolved
salt/cli/batch_async.py Show resolved Hide resolved
salt/cli/batch_async.py Outdated Show resolved Hide resolved
salt/cli/batch_async.py Outdated Show resolved Hide resolved
salt/cli/batch_async.py Outdated Show resolved Hide resolved
salt/client/__init__.py Show resolved Hide resolved
salt/master.py Show resolved Hide resolved
salt/master.py Show resolved Hide resolved
salt/master.py Outdated Show resolved Hide resolved
@thatch45
Copy link
Contributor

Just to cover my bases, I am good with this PR. @DmitryKuzmenko won't be back until Monday, and I would still like him to take a look. But tests are passing and I think this is good to merge.

Copy link
Contributor

@DmitryKuzmenko DmitryKuzmenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this async approach! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants