Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzzer: introduce fuzzer runner #28

Merged
merged 62 commits into from
May 13, 2022
Merged

fuzzer: introduce fuzzer runner #28

merged 62 commits into from
May 13, 2022

Conversation

Ekleog
Copy link
Contributor

@Ekleog Ekleog commented Feb 2, 2022

The fuzzer runs as a separate service on NayDuck worker machines and
run fuzzing tests while the worker is idle. When worker starts
executing a test it signals this to the fuzzer which then stops and
waits for the worker to finish.

The comments around the top of the file are there mostly to help
toggle between ‘run on GCP’ and ‘run on local machine’.

The API this exposes is:

  • on port 7055, an HTTP server that can in particular pause/resume
    fuzzing upon hitting /pause and /resume;
  • on port 5507, Prometheus metrics; and
  • when a new artefact is found (~= a new crash is found), the
    fuzz_artifacts_found metric gets bumped by 1

@Ekleog Ekleog requested a review from mina86 February 2, 2022 13:27
Ekleog added a commit to Ekleog/nearcore that referenced this pull request Feb 2, 2022
Ekleog added a commit to Ekleog/nearcore that referenced this pull request Feb 2, 2022
@Ekleog
Copy link
Contributor Author

Ekleog commented Feb 2, 2022

See also near/nearcore#6232 that defines the config file this PR reads

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mina86 mina86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this script to check.sh and run check.sh on the code? At the moment this needs more docstrings and also types would be nice.

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
@Ekleog
Copy link
Contributor Author

Ekleog commented Feb 4, 2022

Hmm so if I run check.sh locally it seems like it only runs the pytest line. I tried running mypy and pylint manually, but it turns out for some reason pylint crashes after execution and mypy returns some errors like “this module doesn't have this element” which is clearly wrong given the code actually works.

That said I tried to fix as many of the issues as I could find, what do you think about the updated PR?

@mina86
Copy link
Contributor

mina86 commented Feb 4, 2022

Do you have parallel installed? apt install parallel. You’ll also need yapf: python3 -m pip install -U yapf. Mypy is a bit of a pain. You might need to edit some fils in stubs directory.

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated
Comment on lines 136 to 141
# Wait for the fuzzer to complete (ie. either crash or requested to stop)
while proc.poll() == None and not exit_event.is_set():
time.sleep(0.5)
new_time = time.monotonic()
fuzz_time.inc(new_time - last_time)
last_time = new_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, at any event, proc.wait(timeout=1) would get rid of time.sleep().

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
near-bulldozer bot pushed a commit to near/nearcore that referenced this pull request Feb 10, 2022
@Ekleog
Copy link
Contributor Author

Ekleog commented Feb 14, 2022

Ok so I just finished basically rewriting this PR in order to not have one thread per fuzzing process as requested above. I also implemented, I think, most of the review comments above. (I had some time given I learned the new release was actually planned for today, much earlier than I expected, so I just started a one-off VM for the urgent test)

Hopefully this is ok-ish? I still need to go through the check script again, but it's too late for today so I'll look at it tomorrow, and wanted to push that out as soon as possible :)

Copy link
Contributor

@mina86 mina86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. I haven’t thought through the pausing and resuming in regards to race conditions but it’s probably fine as well. Build being done outside of pausing is a bit of an issue though.

Either way, it’s probably good enough to test in production. I reckon the easiest would be to spawn a new machine or grab one of the existing workers and kill the NayDuck worker running there (though this needs to be done carefully since killing a service will page us) and run the fuzzer without worker on the same machine. We can then get the pausing and resuming integrated and get the fuzzing running on all machines.

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated
Comment on lines 260 to 261
log_path: pathlib.Path,
log_filepath: pathlib.Path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_path isn’t really needed here though, is it? It’s always log_filepath.parent, no? Also, wouldn’t we want to print full log_filepath in log messages rather than just the directory path such that we could get rid of self.log_path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's actually not that, and the names were pretty bad; I've just renamed log_path to log_relpath and log_filepath to log_fullpath; hopefully things make more sense now :)

And the reason why we need log_relpath is in order to be able to point the crash report to the full log file of the crash, as it'll be on GCS.

The one that could be made without would be log_fullpath that could be replaced with LOGS_DIR / log_relpath, but I tried to limit the usage of global variables as much as possible in order to make the code more legible − which definitely was not the case with the wrong variable names, but should be better now 😅

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mina86 mina86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I find GitHub interface confusing so it is possible I’ve been looking at old code at times.

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
@mina86
Copy link
Contributor

mina86 commented Feb 16, 2022

OK, let’s revert the blob change for now. It looks like a bit bigger change to really be sensible. Moving just the call to authorise doesn’t buy us much if fuzzer still uses its own credentials file.

@Ekleog
Copy link
Contributor Author

Ekleog commented Feb 16, 2022

Got it, done :) (that said there's a chance that with machine credentials it'd be possible to just remove the call to gcloud because it'd have gsutil working by default… not sure, that'd need checking)

@Ekleog Ekleog force-pushed the fuzzer branch 3 times, most recently from 1a75e56 to 5328719 Compare February 17, 2022 17:20
workers/utils.py Outdated Show resolved Hide resolved
workers/worker.py Outdated Show resolved Hide resolved
workers/utils.py Outdated

FUZZER_CMD_PORT = 7055

class _PausedFuzzers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need some way to configure or detect whether fuzzer is
running. This is why back at the beginning I suggested using Unix
sockets for communications since then NayDuck worker could see if the
socket exists and don’t bother trying to pause if it doesn’t.

We probably would also like some kind of retry logic, i.e. try pausing
and unpausing a few times on failures. It is communicating over
localhost so chances of network failures are zero but still maybe it’s
worth doing something like:

last_exception = None
for n in range(3):
    try:
        requests.get(...).raise_for_status()
        break
    except requests.exceptions.RequestException as ex:
        last_exception = ex
        time.sleep(1.5**n)
else:
    action = 'pause' # 'unpause'
    print(f'Failed to {action} fuzzers: {last_exception}',
          file=sys.stderr)

Lastly, there’s one more theoretical race condition I kinda though not
really worry about. If fuzzer starts after NayDuck starts running the
test, fuzzer would then go ahead and start doing its job. Perhaps we
need some kind of lock file as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the backoff strategy you mentioned, and “solved” the race condition by adjusting the systemd unit file — now I think the only issue that could happen is if the fuzzer were to start before the worker but to take more than the maximum backoff duration to actually spin up the webserver, which would sound surprising to me.

Copy link
Contributor Author

@Ekleog Ekleog Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A solution to avoid this problem would be to integrate the fuzzer with systemd more so systemd only considers it as having started up once the http server is up, but I'm not sure it's really worth it actually doing so, given how unlikely that scenario sounds to me)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t fully solve the issue. A fuzzer could be restarted while worker is running but it does seem like it would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I think a fuzzer restart could happen only with manual intervention restarting the systemd service? (because the daily automated fuzzer restarts shouldn't happen when the fuzzers are paused)
And so if there's manual intervention it's the responsibility of the one actually doing the manual operation, as there's no way to guard against the user eg. just shutting down the worker 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fuzzing service could crashs and be restarted by systemd. And technically speaking in extreme cases even if worker is started after the fuzzer, the fuzzer could take a while to start the HTTP server. We probably can live with this but strictly speaking this isn’t a full solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I hadn't thought of a crash-induced restart. Let's come back to this on the first day there'll be issues then :)

@Ekleog
Copy link
Contributor Author

Ekleog commented Feb 28, 2022

I've just pushed two commits, one to fix an issue I noticed in practice during the week I was off (I forgot provisioning the zuliprc and it silently didn't notify for the runtime-tester false positive that happened), and one to handle your review comments.

I have restarted the testing with the first commit, and plan on testing the second commit tomorrow by spinning up a patched-like-this nayduck-worker on worker1 alongside the nayduck-fuzzer currently running there.

@Ekleog
Copy link
Contributor Author

Ekleog commented Mar 2, 2022

I've also just added a timeout flag to handle the sigstop time better given tonight the worker paused the fuzzer for ~1500s, which is more than the 1200s default timeout of libfuzzer.
I set the timeout to 8000s, more than 2 hours which IIUC is nayduck's maximum job duration, but not infinity either in order to still detect infinite loops.
The code with this latest commit is currently running on worker1, which'll allow for checking whether timeouts still trigger next night.

workers/utils.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
workers/utils.py Outdated

FUZZER_CMD_PORT = 7055

class _PausedFuzzers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t fully solve the issue. A fuzzer could be restarted while worker is running but it does seem like it would help.

@Ekleog
Copy link
Contributor Author

Ekleog commented Mar 8, 2022

i've just implemented your review comments, with this I think we're ready to test this PR on a few more machines :)

automation/setup-host.sh Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@Ekleog Ekleog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review, and reworking and landing the workers changes! I've just finished handling your review comments and rebased on top of master; I'll then go back to my note to self (copied below) and your comment about the duplication of the code to spin up a fuzzer, and then hopefully this will be ready to go :)

Note to self: I also need to 1. make sure we reload eg. flags change when bumping a checkout, and 2. allow overriding release branch options from master to be able to disable / change flags of release branch fuzzing without having to go through the backporting process

automation/setup-host.sh Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Show resolved Hide resolved
…e branch config files from master and deduplicate some code
fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
@Ekleog
Copy link
Contributor Author

Ekleog commented May 11, 2022

This commit (4db4006) is currently running on workers 1 to 10 (included), and seems to be working properly :)

fuzzers/main.py Outdated Show resolved Hide resolved
workers/utils.py Show resolved Hide resolved
@Ekleog
Copy link
Contributor Author

Ekleog commented May 13, 2022

Just pushed a commit fixing the last pylint warnings, the only remaining things are these two mypy errors I don't know how to fix:

fuzzers/main.py:942: error: Call to untyped function (unknown) in typed context
fuzzers/main.py:960: error: Exception must be derived from BaseException

@mina86
Copy link
Contributor

mina86 commented May 13, 2022

I’m getting.

fuzzers/main.py:861:20: W4701: Iterated list 'fuzzers' is being modified inside for loop body, consider iterating through a copy of it instead. (modified-iterating-list)
fuzzers/main.py:672:0: I0021: Useless suppression of 'invalid-name' (useless-suppression)

fuzzers/main.py Outdated Show resolved Hide resolved
fuzzers/main.py Outdated Show resolved Hide resolved
@Ekleog
Copy link
Contributor Author

Ekleog commented May 13, 2022

For modified-iterating-list, it's indeed what I'm trying to do. Does python have issues with it the way C++ does where it's generating UB if one tries to do it?

As for the invalid-name suppression, on my version of pylint I'm getting a warning if I actually remove it, so… :/ Don't care either way, so I'll remove it as T being an invalid name makes no sense to me either.

@mina86
Copy link
Contributor

mina86 commented May 13, 2022

For modified-iterating-list, it's indeed what I'm trying to do. Does python have issues with it the way C++ does where it's generating UB if one tries to do it?

It’s not UB in the sense it is in C++ but it is undefined. Observe:

>>> for n in lst:
...     if n % 2 == 0: lst.remove(n)
...     print(n)
... 
0
2
4
6
8

@Ekleog
Copy link
Contributor Author

Ekleog commented May 13, 2022

Uhhhh ok I must say I'm more than surprised it isn't even detected by mypy or pylint 2.12.2 that I'm using, but good to know, thanks! :)

@mina86
Copy link
Contributor

mina86 commented May 13, 2022

As for the invalid-name suppression, on my version of pylint I'm getting a warning if I actually remove it, so… :/ Don't care either way, so I'll remove it as T being an invalid name makes no sense to me either.

Looks like this was fixed in 2.13.0: pylint-dev/pylint#5894

Just python3 -m pip install -U pylint.

@mina86 mina86 merged commit b9b5d14 into Near-One:master May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants