-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuzzer: introduce fuzzer runner #28
Conversation
See also near/nearcore#6232 that defines the config file this PR reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this script to check.sh
and run check.sh
on the code? At the moment this needs more docstrings and also types would be nice.
Hmm so if I run That said I tried to fix as many of the issues as I could find, what do you think about the updated PR? |
Do you have parallel installed? |
fuzzers/main.py
Outdated
# Wait for the fuzzer to complete (ie. either crash or requested to stop) | ||
while proc.poll() == None and not exit_event.is_set(): | ||
time.sleep(0.5) | ||
new_time = time.monotonic() | ||
fuzz_time.inc(new_time - last_time) | ||
last_time = new_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, at any event, proc.wait(timeout=1)
would get rid of time.sleep()
.
Ok so I just finished basically rewriting this PR in order to not have one thread per fuzzing process as requested above. I also implemented, I think, most of the review comments above. (I had some time given I learned the new release was actually planned for today, much earlier than I expected, so I just started a one-off VM for the urgent test) Hopefully this is ok-ish? I still need to go through the check script again, but it's too late for today so I'll look at it tomorrow, and wanted to push that out as soon as possible :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. I haven’t thought through the pausing and resuming in regards to race conditions but it’s probably fine as well. Build being done outside of pausing is a bit of an issue though.
Either way, it’s probably good enough to test in production. I reckon the easiest would be to spawn a new machine or grab one of the existing workers and kill the NayDuck worker running there (though this needs to be done carefully since killing a service will page us) and run the fuzzer without worker on the same machine. We can then get the pausing and resuming integrated and get the fuzzing running on all machines.
fuzzers/main.py
Outdated
log_path: pathlib.Path, | ||
log_filepath: pathlib.Path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_path
isn’t really needed here though, is it? It’s always log_filepath.parent
, no? Also, wouldn’t we want to print full log_filepath
in log messages rather than just the directory path such that we could get rid of self.log_path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's actually not that, and the names were pretty bad; I've just renamed log_path
to log_relpath
and log_filepath
to log_fullpath
; hopefully things make more sense now :)
And the reason why we need log_relpath
is in order to be able to point the crash report to the full log file of the crash, as it'll be on GCS.
The one that could be made without would be log_fullpath
that could be replaced with LOGS_DIR / log_relpath
, but I tried to limit the usage of global variables as much as possible in order to make the code more legible − which definitely was not the case with the wrong variable names, but should be better now 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I find GitHub interface confusing so it is possible I’ve been looking at old code at times.
OK, let’s revert the blob change for now. It looks like a bit bigger change to really be sensible. Moving just the call to authorise doesn’t buy us much if fuzzer still uses its own credentials file. |
Got it, done :) (that said there's a chance that with machine credentials it'd be possible to just remove the call to gcloud because it'd have gsutil working by default… not sure, that'd need checking) |
1a75e56
to
5328719
Compare
workers/utils.py
Outdated
|
||
FUZZER_CMD_PORT = 7055 | ||
|
||
class _PausedFuzzers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need some way to configure or detect whether fuzzer is
running. This is why back at the beginning I suggested using Unix
sockets for communications since then NayDuck worker could see if the
socket exists and don’t bother trying to pause if it doesn’t.
We probably would also like some kind of retry logic, i.e. try pausing
and unpausing a few times on failures. It is communicating over
localhost so chances of network failures are zero but still maybe it’s
worth doing something like:
last_exception = None
for n in range(3):
try:
requests.get(...).raise_for_status()
break
except requests.exceptions.RequestException as ex:
last_exception = ex
time.sleep(1.5**n)
else:
action = 'pause' # 'unpause'
print(f'Failed to {action} fuzzers: {last_exception}',
file=sys.stderr)
Lastly, there’s one more theoretical race condition I kinda though not
really worry about. If fuzzer starts after NayDuck starts running the
test, fuzzer would then go ahead and start doing its job. Perhaps we
need some kind of lock file as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented the backoff strategy you mentioned, and “solved” the race condition by adjusting the systemd unit file — now I think the only issue that could happen is if the fuzzer were to start before the worker but to take more than the maximum backoff duration to actually spin up the webserver, which would sound surprising to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A solution to avoid this problem would be to integrate the fuzzer with systemd more so systemd only considers it as having started up once the http server is up, but I'm not sure it's really worth it actually doing so, given how unlikely that scenario sounds to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t fully solve the issue. A fuzzer could be restarted while worker is running but it does seem like it would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but I think a fuzzer restart could happen only with manual intervention restarting the systemd service? (because the daily automated fuzzer restarts shouldn't happen when the fuzzers are paused)
And so if there's manual intervention it's the responsibility of the one actually doing the manual operation, as there's no way to guard against the user eg. just shutting down the worker 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fuzzing service could crashs and be restarted by systemd. And technically speaking in extreme cases even if worker is started after the fuzzer, the fuzzer could take a while to start the HTTP server. We probably can live with this but strictly speaking this isn’t a full solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I hadn't thought of a crash-induced restart. Let's come back to this on the first day there'll be issues then :)
I've just pushed two commits, one to fix an issue I noticed in practice during the week I was off (I forgot provisioning the zuliprc and it silently didn't notify for the runtime-tester false positive that happened), and one to handle your review comments. I have restarted the testing with the first commit, and plan on testing the second commit tomorrow by spinning up a patched-like-this nayduck-worker on worker1 alongside the nayduck-fuzzer currently running there. |
I've also just added a timeout flag to handle the sigstop time better given tonight the worker paused the fuzzer for ~1500s, which is more than the 1200s default timeout of libfuzzer. |
workers/utils.py
Outdated
|
||
FUZZER_CMD_PORT = 7055 | ||
|
||
class _PausedFuzzers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t fully solve the issue. A fuzzer could be restarted while worker is running but it does seem like it would help.
i've just implemented your review comments, with this I think we're ready to test this PR on a few more machines :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review, and reworking and landing the workers changes! I've just finished handling your review comments and rebased on top of master; I'll then go back to my note to self (copied below) and your comment about the duplication of the code to spin up a fuzzer, and then hopefully this will be ready to go :)
Note to self: I also need to 1. make sure we reload eg. flags change when bumping a checkout, and 2. allow overriding release branch options from master to be able to disable / change flags of release branch fuzzing without having to go through the backporting process
…e branch config files from master and deduplicate some code
This commit (4db4006) is currently running on workers 1 to 10 (included), and seems to be working properly :) |
Just pushed a commit fixing the last pylint warnings, the only remaining things are these two mypy errors I don't know how to fix:
|
I’m getting.
|
For modified-iterating-list, it's indeed what I'm trying to do. Does python have issues with it the way C++ does where it's generating UB if one tries to do it? As for the invalid-name suppression, on my version of pylint I'm getting a warning if I actually remove it, so… :/ Don't care either way, so I'll remove it as T being an invalid name makes no sense to me either. |
It’s not UB in the sense it is in C++ but it is undefined. Observe:
|
Uhhhh ok I must say I'm more than surprised it isn't even detected by mypy or pylint 2.12.2 that I'm using, but good to know, thanks! :) |
Looks like this was fixed in 2.13.0: pylint-dev/pylint#5894 Just |
The fuzzer runs as a separate service on NayDuck worker machines and
run fuzzing tests while the worker is idle. When worker starts
executing a test it signals this to the fuzzer which then stops and
waits for the worker to finish.
The comments around the top of the file are there mostly to help
toggle between ‘run on GCP’ and ‘run on local machine’.
The API this exposes is:
fuzzing upon hitting
/pause
and/resume
;fuzz_artifacts_found
metric gets bumped by 1