Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for sanitised builds #182

Merged
merged 50 commits into from
Nov 30, 2021
Merged

Adding support for sanitised builds #182

merged 50 commits into from
Nov 30, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Nov 25, 2021

In this PR I introduce a task to build the tests with different sanitisers.

I also add a workflow that runs in parallel to the tests that checks for newly added sanitising errors. It checks against a set of files I commit in this PR containing the output of running the sanitising as of today.

After offline discussion, I think the changes to allow for builds with sanitising are to be kept. The actual workflow file is a different discussion. Before using it, the codebase would have to be in a state where we trigger no warnings. The worflow file is usable and ready to be merged in: it only runs on: workflow_dispatch i.e., if manually triggered from the UI.

The low-hanging fruit to reduce a bunch of warnings is:

  • Address Sanitiser: Redis::dequeueBase memory leak #183
  • Thread Sanitiser: disable ZeroMQ-related code and look into our latch implementation.
  • Undefined Sanitiser: most warnings come unaligned ints in snapshots.
  • Leak Sanitiser: leak sanitiser is a subset of address sanitiser, so we could consider not using it at all.
  • Memory Sanitiser: not use in any hypothetical workflow and add errors on -Wuninitialized and -Wmaybe-uninitialized.

@csegarragonz csegarragonz self-assigned this Nov 25, 2021
@csegarragonz csegarragonz changed the title Sanitisers Adding support for sanitised builds Nov 25, 2021
@eigenraven
Copy link
Collaborator

I'd have a couple of general comments on this:

  • In cmake, add an explicit None/empty string elseif branch, and use a message(FATAL_ERROR) on wrong values to catch typos
  • Just check the return code from the programs, the logs will sometimes have things like total bytes allocated, etc. so there's no point in trying to diff them
  • If you are capturing logs into a file, use tee to redirect it instead of > so they also show up in the actions log, otherwise debugging will be difficult - you can also use a always-condition to run a specific action like cat regardless of whether the build succeeds or fails

@csegarragonz
Copy link
Collaborator Author

@KubaSz thanks for your points, I add some replies:

  • In cmake, add an explicit None/empty string elseif branch, and use a message(FATAL_ERROR) on wrong values to catch typos

We usually run CMake through Python tasks, and I add the typo checking there.

  • Just check the return code from the programs, the logs will sometimes have things like total bytes allocated, etc. so there's no point in trying to diff them

Oh so you are saying the output of sanitised runs is non-deterministic? That can be an issue. The problem is that there are now many warnings. What I was trying to do was to check for just newly added warnings.

I was deliberatly not using tee for that same reason. The output at the moment is intelligible.

@eigenraven
Copy link
Collaborator

eigenraven commented Nov 25, 2021

  • In cmake, add an explicit None/empty string elseif branch, and use a message(FATAL_ERROR) on wrong values to catch typos
    We usually run CMake through Python tasks, and I add the typo checking there.

True, I just like having stringly-typed values checked at every step if it doesn't impact performance. The scripts might evolve at some point and the Cmake or python version might become out of sync, and without the sanitizer it could just silently pass tests.

  • Just check the return code from the programs, the logs will sometimes have things like total bytes allocated, etc. so there's no point in trying to diff them

Oh so you are saying the output of sanitised runs is non-deterministic? That can be an issue. The problem is that there are now many warnings. What I was trying to do was to check for just newly added warnings.

I don't remember the exact format of those outputs, but at least leak checking did output general memory statistics when I last used it, so it would be mostly deterministic, but change with any code changes due to added/removed allocations.

I was deliberatly not using tee for that same reason. The output at the moment is intelligible.

Even if it is hard to read, it's useful to have it - and in the actions view you can just collapse that section if you don't want to look at it at this time.

@csegarragonz csegarragonz force-pushed the sanitisers branch 2 times, most recently from 135b66d to 495b26a Compare November 25, 2021 16:08
@eigenraven
Copy link
Collaborator

This is now ready for review, the only thing that I would potentially look into before merging is the MPI TSan failures (currently disabled in the ignore list file), send/recv do some things with buffers that I don't quite understand and TSan sees it as a data race - it might be a false positive (if the synchronization happens through sockets which is invisible for tsan), or not.

@eigenraven eigenraven requested a review from Shillaker November 29, 2021 10:13
@eigenraven
Copy link
Collaborator

@csegarragonz It would be good if you review this too, I can't ask you via github as you opened the pr

Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KubaSz @csegarragonz awesome work guys, this looks great. A few nitpicky comments mixed in with some small changes/ questions.

.github/workflows/sanitisers.yml Outdated Show resolved Hide resolved
- uses: actions/cache@v2
with:
path: '~/.conan'
key: ${{ runner.os }}-${{ steps.get-conan-version.outputs.conan-version }}-${{ hashFiles('cmake/ExternalProjects.cmake') }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large is the resulting cache file? I know there's a limit on GHA caches, at which point it will evict the oldest keys, just want to make sure this isn't causing us to hit this limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

430 MB, the limit is 10GB per repo (https://github.com/actions/cache), and this should be a cache hit most times as the dependency versions don't change all that often

Copy link
Collaborator Author

@csegarragonz csegarragonz Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this will yield a great speedup to the runtime of general tests if we use the caching there (#187), but @KubaSz it looks as though we may using the wrong key: https://github.com/faasm/faabric/runs/4337385997?check_suite_focus=true#step:7:4

(if the logs are to be trusted of course)

could you double-check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, seems that hashFiles doesn't work, let me look into it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to replace it with manual hashing in bash+sha256sums, it works in the most recent commit. @csegarragonz This change will probably need porting over to #187

tasks/dev.py Outdated Show resolved Hide resolved
src/util/testing.cpp Outdated Show resolved Hide resolved
- name: "Build dependencies to be shared by all sanitiser runs"
run: inv dev.cmake -b Debug

address-sanitiser:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a job for each different sanitizer? This file is quite bloated as a result, so I'd be interested to know the time difference in running them sequentially in a single job, repeating the call to dev.sanitise and faabric_tests each time. The GHA docs say each job in a workflow runs in a "fresh instance" of the VM, but I don't know whether this means a completely separate VM on a different host, or if it's actually just cramming lots of VMs together on the same allocated physical resources (in which case it might be counterproductive).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all runs in parallel, and each sanitizer requires a full rebuild of the source (but not dependencies), so this saves about 20 mins of time on tests

Copy link
Collaborator

@Shillaker Shillaker Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool, is that in theory or in practice? I'm just surprised that GHA gives out resources that freely. I don't think we should change it, am just curious.

src/redis/Redis.cpp Outdated Show resolved Hide resolved
src/redis/Redis.cpp Show resolved Hide resolved
src/scheduler/Executor.cpp Outdated Show resolved Hide resolved
src/scheduler/Scheduler.cpp Show resolved Hide resolved
src/state/StateKeyValue.cpp Outdated Show resolved Hide resolved
# Catch2 allocates in its signal handler, this prevents showing the wrong crash report
signal:*

# TODO: Remove: There's something weird going on with MPI code I don't understand
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for me to see the races tsan is complaining about I'd just have to re-run commenting out the next line right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would terminate on the first detected race. For me that was in the AllGather implementation

@csegarragonz
Copy link
Collaborator Author

The only thing I would add to Simon's comments is to see if you can bump the faabric dependency in faasm as well; just to double check nothing is broken.

Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome LGTM 👍

@eigenraven
Copy link
Collaborator

It looks like a faasm distributed test fails with this merged (on top of PR faasm/faasm#540):
@csegarragonz Does this look like it could be caused directly by my changes, or is it something to do with the races in distributed tests?

[20:44:04:619] [1] [info] =============================================
[20:44:04:619] [1] [info] TEST: Test running an MPI function spanning two hosts
[20:44:04:619] [1] [info] =============================================
[20:44:04:619] [1] [debug] Resetting scheduler
[20:44:04:619] [1] [debug] Resetting scheduler thread-local cache for thread 1
[20:44:04:620] [1] [debug] Resetting dirty tracking after pushing diffs mpi/mpi_bcast
[20:44:04:620] [1] [debug] Resetting dirty tracking
[20:44:04:620] [1] [debug] Scheduling 1/1 calls to mpi/mpi_bcast locally
[20:44:04:622] [1] [debug] Scaling mpi/mpi_bcast from 0 -> 1
[20:44:04:622] [1] [debug] Starting executor 172.18.0.7_966818832
[20:44:04:622] [1] [debug] WAVM module cache initialising mpi/mpi_bcast
[20:44:04:622] [1] [debug] Loading main module mpi/mpi_bcast
[20:44:04:631] [1] [debug] Instantiating module mpi/mpi_bcast  
[20:44:04:633] [1] [debug] Finished instantiating module mpi/mpi_bcast  
[20:44:04:633] [1] [debug] Opened preopened fd at /
[20:44:04:633] [1] [debug] Opened preopened fd at .
[20:44:04:633] [1] [debug] Creating 2 thread stacks
[20:44:04:633] [1] [debug] Successfully executed __wasm_call_ctors for mpi/mpi_bcast
[20:44:04:633] [1] [debug] Successfully executed zygote for mpi/mpi_bcast
[20:44:04:633] [1] [debug] heap_top=11075584 initial_pages=169 initial_table=5
[20:44:04:648] [1] [debug] Wrote snapshot mpi/mpi_bcast_reset to fd 96
[20:44:04:648] [37] [debug] Thread pool thread 172.18.0.7_966818832:0 starting up
[20:44:04:648] [37] [debug] Added thread id 37 to /sys/fs/cgroup/cpu/faasm/tasks
[20:44:04:648] [37] [debug] Not using network ns, support off
[20:44:04:649] [37] [debug] S - args_sizes_get - 4194296 4194300
[20:44:04:649] [37] [debug] S - sbrk - 0
[20:44:04:649] [37] [debug] S - sbrk - 65536
[20:44:04:649] [37] [debug] S - args_get - 11075632 11075600
[20:44:04:649] [37] [debug] S - MPI_Init (create) 0 0
[20:44:04:649] [37] [debug] Initialising world 966818837
[20:44:04:650] [37] [debug] Adding new function call client for 172.18.0.6
[20:44:04:652] [37] [debug] Resetting dirty tracking after pushing diffs mpi/mpi_bcast
[20:44:04:652] [37] [debug] Resetting dirty tracking
[20:44:04:652] [37] [debug] Scheduling 2/3 calls to mpi/mpi_bcast on 172.18.0.6
[20:44:04:653] [37] [debug] Scheduling 1/3 calls to mpi/mpi_bcast locally
[20:44:04:653] [37] [debug] Scaling mpi/mpi_bcast from 1 -> 2
[20:44:04:653] [37] [debug] Starting executor 172.18.0.7_966818849
[20:44:04:662] [37] [debug] MPI-0 S - MPI_Comm_rank 4196616 4194284
[20:44:04:662] [37] [debug] MPI-0 S - MPI_Comm_size 4196616 4194280
[20:44:04:662] [37] [debug] MPI-0 S - MPI_Bcast 4194240 4 4196620 2 4196616
[20:44:04:663] [38] [debug] Thread pool thread 172.18.0.7_966818849:0 starting up
[20:44:04:663] [38] [debug] Added thread id 38 to /sys/fs/cgroup/cpu/faasm/tasks
[20:44:04:663] [38] [debug] Not using network ns, support off
[20:44:04:663] [38] [debug] S - args_sizes_get - 4194296 4194300
[20:44:04:663] [38] [debug] S - sbrk - 0
[20:44:04:663] [38] [debug] S - sbrk - 65536
[20:44:04:663] [38] [debug] S - args_get - 11075632 11075600
[20:44:04:663] [38] [debug] S - MPI_Init (join) 0 0
[20:44:04:663] [38] [debug] MPI-1 S - MPI_Comm_rank 4196616 4194284
[20:44:04:663] [38] [debug] MPI-1 S - MPI_Comm_size 4196616 4194280
[20:44:04:663] [38] [debug] MPI-1 S - MPI_Bcast 4194240 4 4196620 2 4196616
[20:44:04:820] [37] [debug] S - fd_fdstat_get - 1 4194200 (/dev/stdout)
Broadcast succeeded
[20:44:04:821] [37] [debug] MPI-0 S - MPI_Finalize
[20:44:04:821] [37] [debug] Resetting after mpi/mpi_bcast:966818830 (snap key mpi/mpi_bcast_reset)
[20:44:04:823] [38] [debug] S - fd_fdstat_get - 1 4194200 (/dev/stdout)
Broadcast succeeded
[20:44:04:823] [38] [debug] MPI-1 S - MPI_Finalize
[20:44:04:824] [37] [debug] Memory already correct size for snapshot (11075584)
[20:44:04:824] [38] [debug] Resetting after mpi/mpi_bcast:966818840 (snap key mpi/mpi_bcast_reset)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dist_tests is a Catch v2.13.7 host application.
Run with -? for options

-------------------------------------------------------------------------------
Test running an MPI function spanning two hosts
-------------------------------------------------------------------------------
/usr/local/code/faasm/tests/dist/mpi/test_remote_execution.cpp:13
...............................................................................

/usr/local/code/faasm/tests/dist/mpi/test_remote_execution.cpp:44: FAILED:
  REQUIRE( hosts.size() == 2 )
with expansion:
  3 == 2

@csegarragonz
Copy link
Collaborator Author

csegarragonz commented Nov 30, 2021

@KubaSz yes, this should be fixed by #181. but wait, you said you patched on top of faasm/faasm#540, that should contain #181 in a way already 🤔

@eigenraven
Copy link
Collaborator

@csegarragonz It won't contain it as this is a couple of commits behind on faabric, and the way faabric replacement works is it fully replaces the tree with this one. I will merge master and rerun the tests just to make sure, then it'll be ready to merge (I can't as I don't have perms on this repo)

@eigenraven
Copy link
Collaborator

This is a moment where a tool like https://github.com/bors-ng/bors-ng is useful, but I don't think faasm is at a scale where it's worth setting up yet, it's just a temporary influx of PRs with all of us working on some changes right now

@csegarragonz
Copy link
Collaborator Author

@KubaSz yes, this is ready to go; once it's ready ping me on here and i will merge it. thanks again, this looks great

@eigenraven
Copy link
Collaborator

Huh Warning: Docker pull failed with exit code 143, back off 2.786 seconds before retry. when fetching containers for ubsan

@eigenraven
Copy link
Collaborator

@csegarragonz I think it's ready to merge, it fails on the same things that your PR was failing in faasm, and only in dist-tests, so I don't think it needs to be blocked by that

@csegarragonz csegarragonz merged commit 27ffd58 into master Nov 30, 2021
@csegarragonz csegarragonz deleted the sanitisers branch November 30, 2021 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants