Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-120754: Add a strace helper and test set of syscalls for open().read(), Take 2 #123413

Merged
merged 12 commits into from
Nov 3, 2024

Conversation

cmaloney
Copy link
Contributor

@cmaloney cmaloney commented Aug 28, 2024

Reapply GH-121143 with additional changes after it was reverted in GH-123303. Investigation why the bot broke + suggestions for improvements in #121143 (comment)

Note: This needs a buildbot trigger + pass before merge, see: #121143 (comment).

Changes from GH-121143:

  1. Added statx to set of allowed syscall forms (Should make Raspian bot pass).
  2. Check that the fd returned from the open call is passed to all future calls. This helps ensure that the stat call uses the file descriptor rather than the filename to avoid TOCTOU isuses.
  3. Update the Path().read_bytes() test case to additionally validate the reduction inisatty/ioctl + seek calls from GH-120754: Disable buffering in Path.read_bytes #122111.
  4. Better diagnostic assertion messages from @gpshead, so when the test fails have first information immediately available. Makes remote CI debugging much simpler.

NOTE: This needs a full buildbot test pass before merge, see: python#121143 (comment).

1. Added `statx` to set of allowed syscall forms (Should make Raspian bot pass).
2. Check that the `fd` returned from `open` is passed to all future calls. This helps ensure things like the `stat` call uses the file descriptor rather than the `filename` to avoid TOCTOU isuses.
3. Update the `Path().read_bytes()` test case to additionally validate the reduction in`isatty`/`ioctl` + `seek` calls from python#122111
4. Better diagnostic assertion messagess from @gpshead, so when the test fails have first information immediately available. Makes remote CI debugging much simpler.
@hauntsaninja hauntsaninja added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 28, 2024
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @hauntsaninja for commit e196d3d 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 28, 2024
@cmaloney
Copy link
Contributor Author

I looked into the buildbot failures and they don't seem related to the changes I make here (full details below). Raspbian bot which failed post-merge on last PR, does pass here.

Investigating buildbot failures:

  1. AMD64 Arch Linux Usan:
    - C compilation issue, seems unrelated to these changes
  2. PPC64LE RHEL8 Refleaks
    - test.test_concurrent_futures.test_shutdown timed out after 3 hour 49 min
    - seems to be: test_concurrent_futures.test_shutdown: AssertionError: b'' != b'apple' #112542
  3. test.test_asyncio.test_server.TestServer2.test_abort_clients failures
    • Also occur in main on same buildbots (ex. https://buildbot.python.org/#/builders/29/builds/6842)
    • AMD64 Fedora Stable Clang Installed
    • AMD64 Fedora Stable Clang PR
    • AMD64 Fedora Stable LTO + PGO
    • AMD64 Fedora Stable LTO PR
    • AMD64 Fedora Stable PR
    • note: I can get this same failure on my Arch Linux locally on main, may try investigating a bit more. Can't seem to find a issue for it currently.
test.test_asyncio.test_server.TestServer2.test_abort_clients ``` 0:01:43 load avg: 8.22 [353/479/1] test.test_asyncio.test_server failed (1 failure) test_start_server_1 (test.test_asyncio.test_server.ProactorStartServerTests.test_start_server_1) ... skipped 'Windows only' test_start_server_1 (test.test_asyncio.test_server.SelectorStartServerTests.test_start_server_1) ... ok test_start_unix_server_1 (test.test_asyncio.test_server.SelectorStartServerTests.test_start_unix_server_1) ... ok test_abort_clients (test.test_asyncio.test_server.TestServer2.test_abort_clients) ... FAIL test_close_clients (test.test_asyncio.test_server.TestServer2.test_close_clients) ... ok test_wait_closed_basic (test.test_asyncio.test_server.TestServer2.test_wait_closed_basic) ... ok test_wait_closed_race (test.test_asyncio.test_server.TestServer2.test_wait_closed_race) ... ok test_unix_server_addr_cleanup (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_addr_cleanup) ... ok test_unix_server_cleanup_gone (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_cleanup_gone) ... ok test_unix_server_cleanup_prevented (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_cleanup_prevented) ... ok test_unix_server_cleanup_replaced (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_cleanup_replaced) ... ok test_unix_server_sock_cleanup (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_sock_cleanup) ... ok

======================================================================
FAIL: test_abort_clients (test.test_asyncio.test_server.TestServer2.test_abort_clients)

Traceback (most recent call last):
File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.14/unittest/async_case.py", line 93, in _callTestMethod
result = self._callMaybeAsync(method)
File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.14/unittest/async_case.py", line 119, in _callMaybeAsync
return self._asyncioRunner.run(
~~~~~~~~~~~~~~~~~~~~~~~^
func(*args, **kwargs),
^^^^^^^^^^^^^^^^^^^^^^
context=self._asyncioTestContext,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.14/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.14/asyncio/base_events.py", line 721, in run_until_complete
return future.result()
~~~~~~~~~~~~~^^
File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.14/test/test_asyncio/test_server.py", line 249, in test_abort_clients
self.assertNotEqual(s_wr.transport.get_write_buffer_size(), 0)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 0 == 0


Ran 12 tests in 0.250s

FAILED (failures=1, skipped=1)
test test.test_asyncio.test_server failed

@cmaloney
Copy link
Contributor Author

Bots which failed around test_asyncio should pass now with #123423

@cmaloney
Copy link
Contributor Author

cmaloney commented Oct 3, 2024

The Tests / Address sanitizer failure seems to be a case that needs to be handled, investigating it (an mmap shows up in the middle of the I/O calls). Could this get the no news tag (just changes python test. module pieces)

@cmaloney
Copy link
Contributor Author

cmaloney commented Oct 8, 2024

Could this get the "no news" label? (Internal changes). I think this is ready for a more full review again + run bots (merged current main in recently)

Recent changes:

  1. AddressSanitizer + LeakSanitizer can't be used under strace / ptrace (And new versions print a very helpful hint about this), added a skip for that so "Tests / Address sanitizer" should pass now.
  2. Updated for Avoid calling isatty() for most open() calls #90102 to validate the no isatty call

@hauntsaninja hauntsaninja added skip news 🔨 test-with-buildbots Test PR w/ buildbots; report in status section labels Nov 2, 2024
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @hauntsaninja for commit 9e8b23d 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 2, 2024
@cmaloney
Copy link
Contributor Author

cmaloney commented Nov 2, 2024

Buildbots look fairly clean. Three failed (AMD64 Windows10 PR, ARM64 Windows Non-Debug PR, ARM64 Windows PR), all with issues in pegen (AttributeError: module 'token' has no attribute 'FSTRING_START'). Other Windows buildbots passed, and Windows Github action passed as well.

Copy link
Contributor

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this change!

It looks like if unknown syscalls are introduced the current test is fine with that. If you wanted to make a follow-up PR tightening that up, I'd be in favour of it. (I acknowledge that that maybe contradicts the advice in #121143 (comment) , but I think the similar name logic helps address the maintainability concern)

@hauntsaninja hauntsaninja merged commit 9ad57cf into python:main Nov 3, 2024
115 of 118 checks passed
@cmaloney cmaloney deleted the cmaloney/strace_take_2 branch November 3, 2024 03:40
@cmaloney
Copy link
Contributor Author

cmaloney commented Nov 3, 2024

This does prevent additional fstat and ioctl/isatty calls which were the two big performance changes in gh-120754. The seek is in theory all that remains in there but not checked for, and the BufferedIO changes I'm working on should eliminate that. Will be a bit to even get a full approach+proposal around BufferedIO pieces, is refactoring how the middle layer of the I/O stack works and touches a lot more Python C internals (locks, critical sections, ...). With that will look at adding the "no unexpected calls" check, as well as likely add some tests around more specific optimized cases (ex. stdout/stderr just write/append+flush, stdin just read, ...).

@mgorny
Copy link
Contributor

mgorny commented Nov 20, 2024

All the newly added tests fail on Gentoo Linux with sandbox enabled. From a quick check, the test wrongly assumes that the first syscall reported by strace would be an open* call, but it doesn't even verify that — it just blindly takes the return value of the first syscall ever and assumes it would be fd.

If I add some prints to get the list of syscalls, I get e.g. the following:

Check that the set of system calls produced by the I/O stack is what ... StraceEvent(syscall='faccessat2', args=['AT_FDCWD', '"@test_1752124_tmp\\303\\246"', 'F_OK', 'AT_EACCESS'], returncode='0')
StraceEvent(syscall='mmap', args=['NULL', '8208', 'PROT_READ|PROT_WRITE', 'MAP_PRIVATE|MAP_ANONYMOUS', '-1', '0'], returncode='0x7fe693892000')
StraceEvent(syscall='mmap', args=['NULL', '8208', 'PROT_READ|PROT_WRITE', 'MAP_PRIVATE|MAP_ANONYMOUS', '-1', '0'], returncode='0x7fe693892000')
StraceEvent(syscall='getcwd', args=['"/var/tmp/portage/dev-lang/python-3.14.0_alpha2/work/Python-3.14.0a2/build/test_python_1752124\\303\\246"', '8192'], returncode='96')
StraceEvent(syscall='newfstatat', args=['AT_FDCWD', '"/var/tmp/portage/dev-lang/python-3.14.0_alpha2/work/Python-3.14.0a2/build/test_python_1752124\\303\\246"', '{st_mode=S_IFDIR|0755', 'st_size=38', '...}', 'AT_SYMLINK_NOFOLLOW'], returncode='0')
StraceEvent(syscall='newfstatat', args=['AT_FDCWD', '"/var/tmp/portage/dev-lang/python-3.14.0_alpha2/work/Python-3.14.0a2/build/test_python_1752124\\303\\246"', '{st_mode=S_IFDIR|0755', 'st_size=38', '...}', 'AT_SYMLINK_NOFOLLOW'], returncode='0')
StraceEvent(syscall='mmap', args=['NULL', '8208', 'PROT_READ|PROT_WRITE', 'MAP_PRIVATE|MAP_ANONYMOUS', '-1', '0'], returncode='0x7fe69388f000')
StraceEvent(syscall='getcwd', args=['"/var/tmp/portage/dev-lang/python-3.14.0_alpha2/work/Python-3.14.0a2/build/test_python_1752124\\303\\246"', '1024'], returncode='96')
StraceEvent(syscall='readlink', args=['"/var/tmp/portage/dev-lang/python-3.14.0_alpha2/work/Python-3.14.0a2/build/test_python_1752124\\303\\246/@test_1752124_tmp\\303\\246"', '0x7ffecf5289f0', '1023'], returncode='-1 EINVAL (Invalid argument)')
StraceEvent(syscall='openat', args=['AT_FDCWD', '"@test_1752124_tmp\\303\\246"', 'O_RDONLY|O_CLOEXEC'], returncode='3')
StraceEvent(syscall='fstat', args=['3', '{st_mode=S_IFREG|0644', 'st_size=13', '...}'], returncode='0')
StraceEvent(syscall='lseek', args=['3', '0', 'SEEK_CUR'], returncode='0')
StraceEvent(syscall='read', args=['3', '"Hello', 'World!"', '14'], returncode='13')
StraceEvent(syscall='read', args=['3', '""', '1'], returncode='0')
StraceEvent(syscall='close', args=['3'], returncode='0')

i.e. there's a bunch of extra "preparatory" calls before the openat() you're expecting, that should be ignored.

@mgorny
Copy link
Contributor

mgorny commented Nov 20, 2024

Besides, given how fragile this thing is by design, there should be an easy way to disable it (e.g. via -u-strace).

@mgorny
Copy link
Contributor

mgorny commented Nov 20, 2024

It's broken on musl (without any special environment) too:

Check that the set of system calls produced by the I/O stack is what ... StraceEvent(syscall='open', args=['"@test_26640_tmp\\303\\246"
', 'O_RDONLY|O_LARGEFILE|O_CLOEXEC'], returncode='3')
StraceEvent(syscall='fcntl', args=['3', 'F_SETFD', 'FD_CLOEXEC'], returncode='0')
StraceEvent(syscall='fstat', args=['3', '{st_mode=S_IFREG|0644', 'st_size=13', '...}'], returncode='0')
StraceEvent(syscall='mmap', args=['NULL', '16384', 'PROT_READ|PROT_WRITE', 'MAP_PRIVATE|MAP_ANONYMOUS', '-1', '0'], returncode='0x7efd4
027f000')
StraceEvent(syscall='lseek', args=['3', '0', 'SEEK_CUR'], returncode='0')
StraceEvent(syscall='read', args=['3', '"Hello', 'World!"', '14'], returncode='13')
StraceEvent(syscall='read', args=['3', '""', '1'], returncode='0')
StraceEvent(syscall='close', args=['3'], returncode='0')

@cmaloney
Copy link
Contributor Author

cmaloney commented Nov 20, 2024

Happy to work on making the state better on Gentoo. There are tradeoffs in the fragility, and I'm happy to iterate to get everything working well. Unfortunately not having a check like this has resulted in syscalls being re-added unintentionally (ex. bpo-21679 removed a double stat and gh-120754 removed a double stat again).

In the short term I think the --ignore / --ignorefile arguments should allow you to disable this individual test in the same spot you could add a -u argument to enable/disable specific test resources.

There looks like three cases to investigate/work on from this:

  1. Test on Gentoo + Musl + make pass there (there's one extra mmap which doesn't refer to the fd which appears).
  2. Probably disable under gentoo + sandbox, it looks like in that environment significant runtime changes are introduced in the compiled binary. That there are two newfstatat calls in that case worries me / will need to go investigate source.
  3. Potentially add a specialized resource switch -u to enable/disable more broadly.

For 3, did anything in test_subprocess fail for you? test_vfork_used_when_expected uses the same strace testing infrastructure.

For all three of those I think the standard python process is either open a new issue or reopen the gh-120754 one. Happy to open one so can track these changes and get a fix landed.

@mgorny
Copy link
Contributor

mgorny commented Nov 20, 2024

On a glibc system, only test_fileio tests failed. On musl, I haven't run other tests — just this one, since I suspect it'd be affected too.

Could you open the bug, please? I still have lots to do today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants