Async batch #50546

dincamihai · 2018-11-16T16:26:27Z

What does this PR do?

Implementation of RFC#0002

What issues does this PR fix or reference?

When doing a request to /run endpoint with the following payload:

{
    "client": "local_async",
    "eauth": "auto",
    "username": "admin",
    "password": "admin",
    "tgt": "*",
    "fun": "cmd.run",
    "arg": ["sleep $((RANDOM % 30)) && echo hello"],
    "timeout": 5,
    "gather_job_timeout": 5,
    "batch_delay": 1,
    "batch": "2"
}

A response is returned immediately. The response is empty (please ignore that for now) but it basically looks like this: https://github.com/saltstack/salt/blob/7f6e5d89a48bccf19cb45ed52a3a54f5cc53f400/salt/master.py#L2070-L2077

After returning the response, the batch executes a test.ping. Once the test.ping is done an event is published salt/batch/<batch-jid>/start with the following data: https://github.com/saltstack/salt/blob/7d4cae95f3fd54f10ad79544d4c9a2074ab11ed4/salt/cli/batch_async.py#L190-L194

When the batch finishes, it fires salt/batch/<batch-jid>/done event: https://github.com/saltstack/salt/blob/7d4cae95f3fd54f10ad79544d4c9a2074ab11ed4/salt/cli/batch_async.py#L199-L205

Previous Behavior

The response is only returned after the batch job was executed.

New Behavior

The response is returned immediately and the batch job continues to run as tornado coroutines.

Tests written?

TODO

dincamihai · 2018-11-16T16:31:24Z

@moio JFYI and, of course, ideas are welcome.

moio · 2018-11-20T06:09:41Z

What is the value of gather_job_timeout used in cmd_iter in the two cases? Is it the same or different?

dincamihai · 2018-11-20T10:49:26Z

@moio it is the same value, 10s.
Increasing to 30s for the API call does not result in more minions replying to the ping, it just waits longer for the minion that did not reply.
See below the effect of the 30s gather_job_timeout.
It waits from 10:39:29,861 to 10:39:59,969

2018-11-20 10:39:24,734 - master._do_batching - running
2018-11-20 10:39:24,735 - master._run_batches - running
2018-11-20 10:39:24,735 - master._run_batches - done
2018-11-20 10:39:24,757 - batch.__gather_minions - 30
2018-11-20 10:39:24,760 - master.publish - 20181120103924759543 test.ping * []
2018-11-20 10:39:24,816 - master._return - 20181120103924759543 id_JfQHm test.ping
2018-11-20 10:39:24,821 - master._return - 20181120103924759543 id_QRXlv test.ping
2018-11-20 10:39:24,834 - master._return - 20181120103924759543 id_afdeZ test.ping
2018-11-20 10:39:24,852 - master._return - 20181120103924759543 id_qyUbO test.ping
2018-11-20 10:39:29,837 - master.publish - 20181120103929834283 saltutil.find_job ['id_LPOqE'] ['20181120103924759543']
2018-11-20 10:39:29,861 - master._return - 20181120103929834283 id_LPOqE saltutil.find_job
2018-11-20 10:39:59,969 - master.publish - 20181120103959966032 cmd.run ['id_QRXlv', 'id_afdeZ'] ['sleep $((RANDOM % 3)) && echo hello']                                                                                                                                      
2018-11-20 10:40:01,248 - master._return - 20181120103959966032 id_QRXlv cmd.run
2018-11-20 10:40:02,578 - master.publish - 20181120104002577099 cmd.run ['id_JfQHm'] ['sleep $((RANDOM % 3)) && echo hello']
2018-11-20 10:40:02,860 - master._return - 20181120104002577099 id_JfQHm cmd.run
2018-11-20 10:40:03,632 - master.publish - 20181120104003630855 cmd.run ['id_qyUbO'] ['sleep $((RANDOM % 3)) && echo hello']
2018-11-20 10:40:05,678 - master.publish - 20181120104005674763 saltutil.find_job ['id_afdeZ'] ['20181120103959966032']
2018-11-20 10:40:05,723 - master._return - 20181120104005674763 id_afdeZ saltutil.find_job
2018-11-20 10:40:08,743 - master.publish - 20181120104008741501 saltutil.find_job ['id_qyUbO'] ['20181120104003630855']
2018-11-20 10:40:08,788 - master._return - 20181120104008741501 id_qyUbO saltutil.find_job
2018-11-20 10:40:38,926 - batch.run - done
2018-11-20 10:40:38,927 - master._return - 20181120103924759543 id_LPOqE test.ping
2018-11-20 10:40:38,933 - master._return - 20181120103959966032 id_afdeZ cmd.run
2018-11-20 10:40:38,936 - master._return - 20181120104003630855 id_qyUbO cmd.run

Something happens at some point and the test.ping returns are not coming anymore and then when the batching starts the batch job returns come and only after batch finishes the initial ping returns appear.

moio · 2018-11-20T11:00:08Z

Do the pings reach the minions in the first place? What does salt-run state.event pretty=True say?

salt/client/__init__.py

dincamihai · 2018-11-21T13:37:02Z

~~this PR includes now the refactoring from #49818~~

dincamihai · 2018-11-28T14:14:02Z

@moio @cachedout I have updated my PR.
I have decided that the old implementation of Batch is not useful here so I have implemented the async version of batching almost from scratch.
It is using run_job_async to do the initial ping and to publish the job in batches. It is also registering an event handler to gather the returns.

NOTE: the async batching does not include the ssh-minions (we have to see how this can be implemented)

NOTE: there is still some cleanup to be done but I wanted to get some opinions as soon as possible (that's why I removed the WIP)

cachedout · 2018-11-29T18:54:46Z

@dincamihai Can we start with having you please fix these lint errors? https://jenkinsci.saltstack.com/job/pr-lint/job/PR-50546/13/warnings52Result/new/file.-1608698382/

salt/cli/batch_async.py

moio

Hey @dincamihai, this looks good!

I noted some questions/doubts inline. Please pardon my Python inexperience, some points might be trivial or downright wrong.

Apart from them and from tests which I understand being written, the main conceptual question is actually about the following note in the PR description:

The response is empty (please ignore that for now) but it basically looks like this: {"return": [{}]}

What the caller is interested in, after the call is placed, is the list of minions expected to return (those that responded in time to the initial ping and that are thus included in the batch queue).

I can see two ways to address this:

we make the initial ping part blocking, thus the response could contain a minion list (and still proceed asynchronously for the bulk of the batch)
we return nothing/true/any other dummy value, and then provide a way to get to this list (I can presently only imagine via an event)

Frankly speaking, looking at current Uyuni needs, 1. would fit better. Still I will not hide the downside: any request could take up to gather_job_timeout seconds to return (gather_job_timeout seconds being the worst case that would happen when at least one targeted minion takes more than gather_job_timeout to respond, or is down).

How difficult is it to implement one or the other approach, or even even making it configurable?

Do you see aspects I am forgetting about in this discussion?

Do Salt maintainers have any specific remark?

Thanks again for all the efforts here!

salt/cli/batch.py

salt/cli/batch_async.py

salt/client/__init__.py

salt/master.py

- start batching immediatelly after all minions reply to ping - return jid, available minions and missing minions in the response - fire event when batching start and when it ends

thatch45 · 2019-02-22T18:18:41Z

Just to cover my bases, I am good with this PR. @DmitryKuzmenko won't be back until Monday, and I would still like him to take a look. But tests are passing and I think this is good to merge.

DmitryKuzmenko

I like this async approach! 👍

dincamihai requested a review from a team as a code owner November 16, 2018 16:26

ghost requested review from a team November 16, 2018 16:26

cachedout reviewed Nov 20, 2018

View reviewed changes

salt/client/__init__.py Outdated Show resolved Hide resolved

dincamihai force-pushed the async-batch branch from 73d7089 to 6f2376a Compare November 21, 2018 13:29

dincamihai force-pushed the async-batch branch 3 times, most recently from 9b937a3 to 293aded Compare November 28, 2018 14:03

dincamihai changed the title ~~[WIP] Async batch~~ Async batch Nov 28, 2018

dincamihai force-pushed the async-batch branch from 53a105e to 2ac69c9 Compare November 29, 2018 16:05

dincamihai force-pushed the async-batch branch 6 times, most recently from 02360ab to df155a7 Compare December 4, 2018 08:40

dincamihai commented Dec 13, 2018

View reviewed changes

salt/cli/batch_async.py Show resolved Hide resolved

moio reviewed Dec 14, 2018

View reviewed changes

dincamihai force-pushed the async-batch branch from b08fd5e to 2f5da54 Compare December 14, 2018 15:04

lucidd mentioned this pull request Jan 10, 2019

support local async batching SUSE/salt-netapi-client#247

Merged

dincamihai force-pushed the async-batch branch 3 times, most recently from 7112e88 to 616f589 Compare January 18, 2019 13:39

Mihai Dinca added 20 commits February 22, 2019 16:26

Remove TODO comment

843d4be

Make batch_delay a request parameter

44b9185

Improvements

f6c5b18

- start batching immediatelly after all minions reply to ping - return jid, available minions and missing minions in the response - fire event when batching start and when it ends

Allow multiple event handlers

eb45e80

Use config value for gather_job_timeout when not in payload

80d6a35

Add async batch unittests

0d5d1e7

Remove unnecessary else

56726e4

Set utf8 and remove duplicate dict value

b2ff91d

Replace str substitution with format

507b81e

Improve docstring

88a8dd1

Use for subsequent

b03ac2c

Rename method

4536dd6

Remove import

34a7e08

Fix pylint error

8c54407

Fix tests - method rename

150b78c

Improve BatchAsync docstring

5390c2d

Allow metadata to pass

3c4ad8c

Pass metadata only to batch jobs

26b93f9

Add the metadata to the start/done events

ed49234

Pass only metadata not all **kwargs

0935932

dincamihai force-pushed the async-batch branch from 3d613a2 to 0935932 Compare February 22, 2019 15:27

DmitryKuzmenko approved these changes Feb 25, 2019

View reviewed changes

thatch45 merged commit 4b37292 into saltstack:develop Feb 25, 2019

dincamihai mentioned this pull request Feb 27, 2019

Add separate batch presence_ping timeout #51863

Merged

This was referenced Nov 12, 2019

Port batch-async to master saltstack/community#71

Closed

[WIP] Porting Async Batch #50546 #55268

Closed

vzhestkov mentioned this pull request May 28, 2021

Async batch #60269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async batch #50546

Async batch #50546

dincamihai commented Nov 16, 2018 •

edited

Loading

dincamihai commented Nov 16, 2018

moio commented Nov 20, 2018

dincamihai commented Nov 20, 2018

moio commented Nov 20, 2018

dincamihai commented Nov 21, 2018 •

edited

Loading

dincamihai commented Nov 28, 2018 •

edited

Loading

cachedout commented Nov 29, 2018

moio left a comment

thatch45 commented Feb 22, 2019

DmitryKuzmenko left a comment

Async batch #50546

Async batch #50546

Conversation

dincamihai commented Nov 16, 2018 • edited Loading

What does this PR do?

What issues does this PR fix or reference?

Previous Behavior

New Behavior

Tests written?

dincamihai commented Nov 16, 2018

moio commented Nov 20, 2018

dincamihai commented Nov 20, 2018

moio commented Nov 20, 2018

dincamihai commented Nov 21, 2018 • edited Loading

dincamihai commented Nov 28, 2018 • edited Loading

cachedout commented Nov 29, 2018

moio left a comment

Choose a reason for hiding this comment

thatch45 commented Feb 22, 2019

DmitryKuzmenko left a comment

Choose a reason for hiding this comment

dincamihai commented Nov 16, 2018 •

edited

Loading

dincamihai commented Nov 21, 2018 •

edited

Loading

dincamihai commented Nov 28, 2018 •

edited

Loading