-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-34060: Report system load when running test suite for Windows #8357
Conversation
ede0a5a
to
da62440
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! And I'll be happy to have support for job objects in there too :)
Looking at the test runs, the numbers seem to be consistent with other platforms, but since I don't have as good a feel for what to expect here, I'd like someone else who's been involved in this or the previous PR to sign off as well.
Modules/_winapi.c
Outdated
Py_UNICODE *job_name) | ||
/*[clinic end generated code: output=ddb90fccf4f363b9 input=f1350a07abdab61c]*/ | ||
{ | ||
HANDLE job_handle = CreateJobObjectW(NULL, job_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should surround the non-Python parts of this function with _Py_BEGIN_ALLOW_THREADS
and _Py_END_ALLOW_THREADS
to release the GIL. Similarly for the Assign function.
Lib/test/libregrtest/main.py
Outdated
self.getloadavg = lambda: os.getloadavg()[0] | ||
elif sys.platform == 'win32': | ||
self.load_tracker = WindowsLoadTracker() | ||
self.getloadavg = lambda: self.load_tracker.getloadavg() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be self.getloadavg = self.load_tracker.getloadavg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And since there are no later references to self.load_tracker
, that doesn't need to be kept on the object (the references via the method will keep it from being deallocated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks :)
Yeah I'd definitely like to get @vstinner's take on it since he is likely familiar with normal load values. |
Modules/_winapi.c
Outdated
|
||
if (job_handle == NULL) { | ||
PyErr_SetFromWindowsErr(GetLastError()); | ||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HANDLE_return_converter
requires INVALID_HANDLE_VALUE
for an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks.
Did I say thatI? A thread is fine here. regrtest already uses threads to run tests in subprocesses when using -jN. faulthandler also uses a C thread to implement an hard timeout (dumping the Python traceback on timeout). regrtest is full of threads :-) Overlapped IO may be more complicated than a thread, no? |
Lib/test/libregrtest/utils.py
Outdated
|
||
# Spawn off the load monitor | ||
command = ['typeperf', COUNTER_NAME, '-si', str(SAMPLING_INTERVAL)] | ||
self.p = subprocess.Popen(command, stdout=command_stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python still supports Windows 7, which only allows a process to be assigned to one job at a time. If the OS version is prior to NT 6.2 (i.e. sys.getwindowsversion() < (6, 2)
), use the creation flag
CREATE_BREAKAWAY_FROM_JOB
(0x01000000) to try to break the child process out of the current job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah thank you, I've been testing on Windows 10 and didn't read the documentation carefully enough, is there any particular reason to scope it down to <NT 6.2? Will there be a problem with just using CREATE_BREAKAWAY_FROM_JOB
as the creationflags
all the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason to break the child out of the job hierarchy in Windows 8+, especially since jobs are used more frequently (e.g. to implement silos, such as for Windows containers).
Also, trying to break away from the parent job complicates creating the process. I forgot to mention that Popen
will fail with a PermissionError
if the child isn't allowed to break away. This can be handled by retrying without CREATE_BREAKAWAY_FROM_JOB
. In this case you could skip calling AssignProcessToJobObject
. For example (untested):
CREATE_BREAKAWAY_FROM_JOB = 0x01000000
cflags = 0 if sys.getwindowsversion() >= (6, 2) else CREATE_BREAKAWAY_FROM_JOB
assign_to_job = True
try:
self.p = subprocess.Popen(command, stdout=command_stdout,
creationflags=cflags)
except PermissionError:
if cflags == 0:
raise
self.p = subprocess.Popen(command, stdout=command_stdout)
assign_to_job = False
if assign_to_job:
_winapi.AssignProcessToJobObject(job_group, self.p._handle)
Lib/test/libregrtest/utils.py
Outdated
self.p = subprocess.Popen(command, stdout=command_stdout) | ||
|
||
# Attach it to our job group | ||
_winapi.AssignProcessToJobObject(job_group, self.p._handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this in a try/except that handles failure either by ignoring it or with a warning. AssignProcessToJobObject
will fail in Windows 7 if the parent process (i.e. the current Python process) is in a job that doesn't allow child processes to break away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll issue a warning since this to handle a rather rare case of the interpreter actually crashing in -jN
mode.
Modules/_winapi.c
Outdated
// we only support a subset of all the limits possible, can be expanded | ||
// in the future if needed. | ||
limit.BasicLimitInformation.LimitFlags = | ||
(DWORD) getulong(limit_info, "LimitFlags"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should reflect the WinAPI structure, i.e. the BasicLimitInformation
field. Also, it would be nice to have PyWin32 interoperability, which does the latter and also uses dicts instead of simple namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I based this on CreateProcess
which also uses attributes in a passed in object for this information. What do you mean by reflect the structure? Do you mean support all the fields for BasicLimitInformation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, here is how subprocess passes uses it with CreateProcess
Lines 129 to 137 in 06ca3f0
class STARTUPINFO: | |
def __init__(self, *, dwFlags=0, hStdInput=None, hStdOutput=None, | |
hStdError=None, wShowWindow=0, lpAttributeList=None): | |
self.dwFlags = dwFlags | |
self.hStdInput = hStdInput | |
self.hStdOutput = hStdOutput | |
self.hStdError = hStdError | |
self.wShowWindow = wShowWindow | |
self.lpAttributeList = lpAttributeList or {"handle_list": []} |
I just used simple namespace because I was being lazy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes a dict is used, e.g. STARTUPINFO
uses a dict for the new lpAttributeList
support in 3.7. win32job
uses dicts, but it's fine to stick with namespaces instead. I still would rather LimitFlags
be set in a BasicLimitInformation
attribute instead of at the top level. _winapi is low-level and undocumented, so it's best if it mirrors the actual API, if it's not overly awkward. I didn't meant to support all fields, however. _winapi gets extended only as required.
This is the failure that shows up when using a thread:
|
fc8c05d
to
63b57b2
Compare
So I think I jumped the gun early with the job grouping stuff. There was a much easier solution to dealing with interpreter crashes in This is now actually just pure python and consequently a lot simpler. |
Lib/test/libregrtest/main.py
Outdated
self.getloadavg = None | ||
if hasattr(os, 'getloadavg'): | ||
self.getloadavg = lambda: os.getloadavg()[0] | ||
elif sys.platform == 'win32' and (self.ns.slaveargs is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to import the class there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Lib/test/libregrtest/main.py
Outdated
@@ -533,6 +534,13 @@ def main(self, tests=None, **kwargs): | |||
def _main(self, tests, kwargs): | |||
self.ns = self.parse_args(kwargs) | |||
|
|||
self.getloadavg = None | |||
if hasattr(os, 'getloadavg'): | |||
self.getloadavg = lambda: os.getloadavg()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike lambda. Would you mind to define a method here using "def" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer lambda for this sort of short stuff but done
Lib/test/libregrtest/utils.py
Outdated
@@ -54,3 +58,94 @@ def printlist(x, width=70, indent=4, file=None): | |||
print(textwrap.fill(' '.join(str(elt) for elt in sorted(x)), width, | |||
initial_indent=blanks, subsequent_indent=blanks), | |||
file=file) | |||
|
|||
BUFSIZE = 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the new code in a new file, like libregrtest/winloadavg.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Lib/test/libregrtest/utils.py
Outdated
COUNTER_NAME = r'\System\Processor Queue Length' | ||
|
||
""" | ||
This class asynchronously interacts with the `typeperf` command to read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the docstring inside the class.
Lib/test/libregrtest/utils.py
Outdated
|
||
def __del__(self): | ||
self.p.kill() | ||
self.p.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use del() but add a close() method for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are you sure about this? It would require us to keep a reference to load_tracker
and it needs an:
if self.load_tracker:
self.load_tracker.stop()
because we don't have a load tracker to stop on non windows systems.
Lib/test/libregrtest/utils.py
Outdated
return self.load | ||
|
||
# Process the backlog of load values | ||
for line in typeperf_output.splitlines(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the last line is incomplete: doesn't end with a newline character? Maybe you should put it back into a buffer, and concatenate it to the output, next time. Maybe use .splitlines(True) to check if there is a newline character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its worth adding the extra complexity to handle this. Worst case is we miss a single point or two of data and the load number is slightly off.
Lib/test/libregrtest/utils.py
Outdated
|
||
BUFSIZE = 8192 | ||
LOAD_FACTOR_1 = 0.9200444146293232478931553241 | ||
SAMPLING_INTERVAL = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment to document the unit (seconds, no?).
Lib/test/libregrtest/utils.py
Outdated
# Close our copy of the write end of the pipe | ||
os.close(command_stdout) | ||
|
||
def read_output(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that this function is only called every 5 minutes. I expect a lot of output in this case.
Why not running this function inside a thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failure shows up when running in a thread: #8357 (comment)
Even at 5 minutes, that's 60 points of data. Even if the typeperf command puts out 100 bytes of output per line its not enough to saturate the buffer. And even if it does, those data points will get picked up eventually by the next call.
Lib/test/libregrtest/main.py
Outdated
self.getloadavg = lambda: os.getloadavg()[0] | ||
elif sys.platform == 'win32' and (self.ns.slaveargs is None): | ||
load_tracker = WindowsLoadTracker() | ||
self.getloadavg = load_tracker.getloadavg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seems like you spawn a subprocess even in test worker processes, whereas you wrote that it's useless. This code should be moved after handling slaveargs, no? Maybe move the slaveargs handling code earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, we already know if the runner is a slave based on the parsing of the arguments. The arguments have been parsed by this point, so this should be fine. I also tested and under task manager only one typeperf instance shows up under the main python process unlike before.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
6990d2c
to
a51f181
Compare
I have made the requested changes; please review again |
It looks like there was a lot of interest and activity on this PR a few months ago and @zooba had approved it. @ammaraskar, could you resolve the merge conflict? I think that might be all that is needed, along with @vstinner's approval for merging. Thanks! |
It seems like my comments have been addressed.
I'm sorry but I don't have the bandwidth right to review this change (test it manually). @ammaraskar: You have to update your PR, there is now a conflict. @zware: If you are confident that the change is good, please go ahead and merge it (once CI tests and the conflict is solved). |
I haven't researched whether this is the best way to do this to any extent, but this looks fine to me. |
While Windows exposes the system processor queue length: the raw value used for load calculations on Unix systems, it does not provide an API to access the averaged value. Hence to calculate the load we must track and average it ourselves. We can't use multiprocessing or a thread to read it in the background while the tests run since using those would conflict with test_multiprocessing and test_xxsubprocess. Thus, we use Window's asynchronous IO API to run the tracker in the background with it sampling at the correct rate. When we wish to access the load we check to see if there's new data on the stream, if there is, we update our load values.
fb0a14d
to
31a71df
Compare
31a71df
to
a8df864
Compare
Thanks @ammaraskar! |
* Clean up code which checked presence of os.{stat,lstat,chmod} (GH-11643) (cherry picked from commit 8377cd4) * bpo-36725: regrtest: add TestResult type (GH-12960) * Add TestResult and MultiprocessResult types to ensure that results always have the same fields. * runtest() now handles KeyboardInterrupt * accumulate_result() and format_test_result() now takes a TestResult * cleanup_test_droppings() is now called by runtest() and mark the test as ENV_CHANGED if the test leaks support.TESTFN file. * runtest() now includes code "around" the test in the test timing * Add print_warning() in test.libregrtest.utils to standardize how libregrtest logs warnings to ease parsing the test output. * support.unload() is now called with abstest rather than test_name * Rename 'test' variable/parameter to 'test_name' * dash_R(): remove unused the_module parameter * Remove unused imports (cherry picked from commit 4d29983) * bpo-36725: Refactor regrtest multiprocessing code (GH-12961) Rewrite run_tests_multiprocess() function as a new MultiprocessRunner class with multiple methods to better report errors and stop immediately when needed. Changes: * Worker processes are now killed immediately if tests are interrupted or if a test does crash (CHILD_ERROR): worker processes are killed. * Rewrite how errors in a worker thread are reported to the main thread. No longer ignore BaseException or parsing errors silently. * Remove 'finished' variable: use worker.is_alive() instead * Always compute omitted tests. Add Regrtest.get_executed() method. (cherry picked from commit 3cde440) * bpo-36719: regrtest always detect uncollectable objects (GH-12951) regrtest now always detects uncollectable objects. Previously, the check was only enabled by --findleaks. The check now also works with -jN/--multiprocess N. --findleaks becomes a deprecated alias to --fail-env-changed. (cherry picked from commit 75120d2) * bpo-34060: Report system load when running test suite for Windows (GH-8357) While Windows exposes the system processor queue length, the raw value used for load calculations on Unix systems, it does not provide an API to access the averaged value. Hence to calculate the load we must track and average it ourselves. We can't use multiprocessing or a thread to read it in the background while the tests run since using those would conflict with test_multiprocessing and test_xxsubprocess. Thus, we use Window's asynchronous IO API to run the tracker in the background with it sampling at the correct rate. When we wish to access the load we check to see if there's new data on the stream, if there is, we update our load values. (cherry picked from commit e16467a) * bpo-36719: Fix regrtest re-run (GH-12964) Properly handle a test which fail but then pass. Add test_rerun_success() unit test. (cherry picked from commit 837acc1) * bpo-36719: regrtest closes explicitly WindowsLoadTracker (GH-12965) Regrtest.finalize() now closes explicitly the WindowsLoadTracker instance. (cherry picked from commit 00db7c7)
This is (mostly) a pure Python implementation of the other PR. It leverages the
typeperf
command which monitors performance counters and outputs them at a given interval. So every 5 seconds,typeperf
can output the processor queue length into stdout.subprocess.stdout.readline
is a blocking call. Using a thread seemed like an obvious solution, but we can't achieve this with multiprocessing or a thread, because like Victor speculated in the previous bug report on this, it conflicts withtest_multiprocessing
andtest_threading
. Hence, I opted to use the asynchronous/overlapped IO API which was designed forasync
. Most of the diff actually just pertains to using this rather low level API.This is almost a pure python implementation but there was one edge case where this would fail. Namely, when the python interpreter running the test suite crashes, this leaves an orphaned
typeperf
process running which refuses to die. This means that when the test suite is run with-j x
and this situation happens:The big test coordinating python process will wait forever on the crashed python and consequently
typeperf
to terminate, which just doesn't happen by default in Windows. After reading up on the APIs, the right way to fix this is by using a Job Object to ask the OS to kill the child when the parent dies. Hence, there is a change in_winapi
to make this happen. Unlike the last PR, this API is actually reusable and fit to be exposed to the public. It could even allow implementing things like bpo-5115 to be a lot easier.https://bugs.python.org/issue34060