fuzzer: introduce fuzzer runner #28

Ekleog · 2022-02-02T13:27:30Z

The fuzzer runs as a separate service on NayDuck worker machines and
run fuzzing tests while the worker is idle. When worker starts
executing a test it signals this to the fuzzer which then stops and
waits for the worker to finish.

The comments around the top of the file are there mostly to help
toggle between ‘run on GCP’ and ‘run on local machine’.

The API this exposes is:

on port 7055, an HTTP server that can in particular pause/resume
fuzzing upon hitting /pause and /resume;
on port 5507, Prometheus metrics; and
when a new artefact is found (~= a new crash is found), the
fuzz_artifacts_found metric gets bumped by 1

For use by Near-One/nayduck#28

Ekleog · 2022-02-02T14:14:12Z

See also near/nearcore#6232 that defines the config file this PR reads

fuzzers/main.py

mina86

Could you add this script to check.sh and run check.sh on the code? At the moment this needs more docstrings and also types would be nice.

fuzzers/main.py

Ekleog · 2022-02-04T16:43:52Z

Hmm so if I run check.sh locally it seems like it only runs the pytest line. I tried running mypy and pylint manually, but it turns out for some reason pylint crashes after execution and mypy returns some errors like “this module doesn't have this element” which is clearly wrong given the code actually works.

That said I tried to fix as many of the issues as I could find, what do you think about the updated PR?

mina86 · 2022-02-04T16:49:28Z

Do you have parallel installed? apt install parallel. You’ll also need yapf: python3 -m pip install -U yapf. Mypy is a bit of a pain. You might need to edit some fils in stubs directory.

fuzzers/main.py

mina86 · 2022-02-04T17:12:06Z

fuzzers/main.py

+        # Wait for the fuzzer to complete (ie. either crash or requested to stop)
+        while proc.poll() == None and not exit_event.is_set():
+            time.sleep(0.5)
+            new_time = time.monotonic()
+            fuzz_time.inc(new_time - last_time)
+            last_time = new_time


By the way, at any event, proc.wait(timeout=1) would get rid of time.sleep().

fuzzers/main.py

For use by Near-One/nayduck#28

Ekleog · 2022-02-14T18:51:27Z

Ok so I just finished basically rewriting this PR in order to not have one thread per fuzzing process as requested above. I also implemented, I think, most of the review comments above. (I had some time given I learned the new release was actually planned for today, much earlier than I expected, so I just started a one-off VM for the urgent test)

Hopefully this is ok-ish? I still need to go through the check script again, but it's too late for today so I'll look at it tomorrow, and wanted to push that out as soon as possible :)

mina86

LG. I haven’t thought through the pausing and resuming in regards to race conditions but it’s probably fine as well. Build being done outside of pausing is a bit of an issue though.

Either way, it’s probably good enough to test in production. I reckon the easiest would be to spawn a new machine or grab one of the existing workers and kill the NayDuck worker running there (though this needs to be done carefully since killing a service will page us) and run the fuzzer without worker on the same machine. We can then get the pausing and resuming integrated and get the fuzzing running on all machines.

fuzzers/main.py

mina86 · 2022-02-15T04:18:37Z

fuzzers/main.py

+        log_path: pathlib.Path,
+        log_filepath: pathlib.Path,


log_path isn’t really needed here though, is it? It’s always log_filepath.parent, no? Also, wouldn’t we want to print full log_filepath in log messages rather than just the directory path such that we could get rid of self.log_path?

So it's actually not that, and the names were pretty bad; I've just renamed log_path to log_relpath and log_filepath to log_fullpath; hopefully things make more sense now :)

And the reason why we need log_relpath is in order to be able to point the crash report to the full log file of the crash, as it'll be on GCS.

The one that could be made without would be log_fullpath that could be replaced with LOGS_DIR / log_relpath, but I tried to limit the usage of global variables as much as possible in order to make the code more legible − which definitely was not the case with the wrong variable names, but should be better now 😅

fuzzers/main.py

mina86

Yeah, I find GitHub interface confusing so it is possible I’ve been looking at old code at times.

fuzzers/main.py

mina86 · 2022-02-16T14:48:51Z

OK, let’s revert the blob change for now. It looks like a bit bigger change to really be sensible. Moving just the call to authorise doesn’t buy us much if fuzzer still uses its own credentials file.

Ekleog · 2022-02-16T16:02:27Z

Got it, done :) (that said there's a chance that with machine credentials it'd be possible to just remove the call to gcloud because it'd have gsutil working by default… not sure, that'd need checking)

workers/utils.py

workers/worker.py

mina86 · 2022-02-18T14:31:01Z

workers/utils.py

+
+FUZZER_CMD_PORT = 7055
+
+class _PausedFuzzers:


We probably need some way to configure or detect whether fuzzer is
running. This is why back at the beginning I suggested using Unix
sockets for communications since then NayDuck worker could see if the
socket exists and don’t bother trying to pause if it doesn’t.

We probably would also like some kind of retry logic, i.e. try pausing
and unpausing a few times on failures. It is communicating over
localhost so chances of network failures are zero but still maybe it’s
worth doing something like:

last_exception = None for n in range(3): try: requests.get(...).raise_for_status() break except requests.exceptions.RequestException as ex: last_exception = ex time.sleep(1.5**n) else: action = 'pause' # 'unpause' print(f'Failed to {action} fuzzers: {last_exception}', file=sys.stderr)

Lastly, there’s one more theoretical race condition I kinda though not
really worry about. If fuzzer starts after NayDuck starts running the
test, fuzzer would then go ahead and start doing its job. Perhaps we
need some kind of lock file as well?

I've implemented the backoff strategy you mentioned, and “solved” the race condition by adjusting the systemd unit file — now I think the only issue that could happen is if the fuzzer were to start before the worker but to take more than the maximum backoff duration to actually spin up the webserver, which would sound surprising to me.

(A solution to avoid this problem would be to integrate the fuzzer with systemd more so systemd only considers it as having started up once the http server is up, but I'm not sure it's really worth it actually doing so, given how unlikely that scenario sounds to me)

This doesn’t fully solve the issue. A fuzzer could be restarted while worker is running but it does seem like it would help.

Right, but I think a fuzzer restart could happen only with manual intervention restarting the systemd service? (because the daily automated fuzzer restarts shouldn't happen when the fuzzers are paused)
And so if there's manual intervention it's the responsibility of the one actually doing the manual operation, as there's no way to guard against the user eg. just shutting down the worker 😅

The fuzzing service could crashs and be restarted by systemd. And technically speaking in extreme cases even if worker is started after the fuzzer, the fuzzer could take a while to start the HTTP server. We probably can live with this but strictly speaking this isn’t a full solution.

You're right, I hadn't thought of a crash-induced restart. Let's come back to this on the first day there'll be issues then :)

Ekleog · 2022-02-28T18:51:30Z

I've just pushed two commits, one to fix an issue I noticed in practice during the week I was off (I forgot provisioning the zuliprc and it silently didn't notify for the runtime-tester false positive that happened), and one to handle your review comments.

I have restarted the testing with the first commit, and plan on testing the second commit tomorrow by spinning up a patched-like-this nayduck-worker on worker1 alongside the nayduck-fuzzer currently running there.

Ekleog · 2022-03-02T13:10:14Z

I've also just added a timeout flag to handle the sigstop time better given tonight the worker paused the fuzzer for ~1500s, which is more than the 1200s default timeout of libfuzzer.
I set the timeout to 8000s, more than 2 hours which IIUC is nayduck's maximum job duration, but not infinity either in order to still detect infinite loops.
The code with this latest commit is currently running on worker1, which'll allow for checking whether timeouts still trigger next night.

workers/utils.py

fuzzers/main.py

mina86 · 2022-03-02T19:10:16Z

workers/utils.py

+
+FUZZER_CMD_PORT = 7055
+
+class _PausedFuzzers:


This doesn’t fully solve the issue. A fuzzer could be restarted while worker is running but it does seem like it would help.

Ekleog · 2022-03-08T12:32:42Z

i've just implemented your review comments, with this I think we're ready to test this PR on a few more machines :)

automation/setup-host.sh

fuzzers/main.py

Ekleog

Thank you for your review, and reworking and landing the workers changes! I've just finished handling your review comments and rebased on top of master; I'll then go back to my note to self (copied below) and your comment about the duplication of the code to spin up a fuzzer, and then hopefully this will be ready to go :)

Note to self: I also need to 1. make sure we reload eg. flags change when bumping a checkout, and 2. allow overriding release branch options from master to be able to disable / change flags of release branch fuzzing without having to go through the backporting process

automation/setup-host.sh

fuzzers/main.py

…e branch config files from master and deduplicate some code

fuzzers/main.py

Ekleog · 2022-05-11T18:52:16Z

This commit (4db4006) is currently running on workers 1 to 10 (included), and seems to be working properly :)

fuzzers/main.py

workers/utils.py

Ekleog · 2022-05-13T15:52:47Z

Just pushed a commit fixing the last pylint warnings, the only remaining things are these two mypy errors I don't know how to fix:

fuzzers/main.py:942: error: Call to untyped function (unknown) in typed context
fuzzers/main.py:960: error: Exception must be derived from BaseException

mina86 · 2022-05-13T16:04:42Z

I’m getting.

fuzzers/main.py:861:20: W4701: Iterated list 'fuzzers' is being modified inside for loop body, consider iterating through a copy of it instead. (modified-iterating-list)
fuzzers/main.py:672:0: I0021: Useless suppression of 'invalid-name' (useless-suppression)

fuzzers/main.py

Ekleog · 2022-05-13T16:25:16Z

For modified-iterating-list, it's indeed what I'm trying to do. Does python have issues with it the way C++ does where it's generating UB if one tries to do it?

As for the invalid-name suppression, on my version of pylint I'm getting a warning if I actually remove it, so… :/ Don't care either way, so I'll remove it as T being an invalid name makes no sense to me either.

mina86 · 2022-05-13T16:48:13Z

For modified-iterating-list, it's indeed what I'm trying to do. Does python have issues with it the way C++ does where it's generating UB if one tries to do it?

It’s not UB in the sense it is in C++ but it is undefined. Observe:

>>> for n in lst:
...     if n % 2 == 0: lst.remove(n)
...     print(n)
... 
0
2
4
6
8

Ekleog · 2022-05-13T16:52:32Z

Uhhhh ok I must say I'm more than surprised it isn't even detected by mypy or pylint 2.12.2 that I'm using, but good to know, thanks! :)

mina86 · 2022-05-13T16:56:36Z

As for the invalid-name suppression, on my version of pylint I'm getting a warning if I actually remove it, so… :/ Don't care either way, so I'll remove it as T being an invalid name makes no sense to me either.

Looks like this was fixed in 2.13.0: pylint-dev/pylint#5894

Just python3 -m pip install -U pylint.

Ekleog requested a review from mina86 February 2, 2022 13:27

Ekleog added a commit to Ekleog/nearcore that referenced this pull request Feb 2, 2022

fuzz: introduce fuzzer configuration

f5c823d

For use by Near-One/nayduck#28

Ekleog added a commit to Ekleog/nearcore that referenced this pull request Feb 2, 2022

fuzz: introduce fuzzer configuration

004fa95

For use by Near-One/nayduck#28

Ekleog mentioned this pull request Feb 2, 2022

fuzz: introduce fuzzer configuration near/nearcore#6232

Merged

mina86 reviewed Feb 2, 2022

View reviewed changes

Ekleog force-pushed the fuzzer branch from 032b5a1 to b989c11 Compare February 3, 2022 14:16

mina86 reviewed Feb 4, 2022

View reviewed changes

near-bulldozer bot pushed a commit to near/nearcore that referenced this pull request Feb 10, 2022

fuzz: introduce fuzzer configuration (#6232)

5ce5b4e

For use by Near-One/nayduck#28